Error in tt-mlir uplift

vmilosevic commented 2 months ago

Running tests with latest tt-mlir fails with the following errors:

2024-09-18 09:06:21.033 | FATAL    | TTDevice        - Tensor 0 - data type mismatch: expected UInt8, got UInt32

2024-09-18 09:06:28.814 | FATAL    | TTDevice        - Tensor 1 - data type mismatch: expected Float32, got BFloat16

2024-09-18 09:09:45.831 | INFO     | forge.compiled_graph_state:__call__:330 - Running model DualConv2d on device...
                 Always | FATAL    | Unsupported data type

For a full log look at the CI run https://github.com/tenstorrent/tt-forge-fe/actions/runs/10918549924 PR https://github.com/tenstorrent/tt-forge-fe/pull/307

vmilosevic commented 2 months ago

Another issue is that after the last error, the test run goes into a loop instead of exiting

terminate called recursively
tt_forge_signal_handler - signal: 6 (abort)
terminate called recursively
tt_forge_signal_handler - signal: 6 (abort)
terminate called recursively
tt_forge_signal_handler - signal: 6 (abort)
terminate called recursively

nvukobratTT commented 2 months ago

Looks like dual conv2d fails during runtime:

...
2024-09-18 09:09:45.831 | INFO     | forge.compiled_graph_state:__call__:330 - Running model DualConv2d on device...
                 Always | FATAL    | Unsupported data type
terminate called after throwing an instance of 'std::runtime_error'
                  Metal | INFO     | Closing device 0
                  Metal | INFO     | Closing device 0
  what():  TT_THROW @ /__w/tt-forge-fe/tt-forge-fe/third_party/tt-mlir/third_party/tt-metal/src/tt-metal/ttnn/cpp/ttnn/tensor/tensor_impl.cpp:42: tt::exception
info:
Unsupported data type
backtrace:
 --- tt::tt_metal::tensor_impl::element_size_bytes(tt::tt_metal::DataType)
 --- tt::tt_metal::create_device_tensor(tt::tt_metal::Shape const&, tt::tt_metal::DataType, tt::tt_metal::Layout, tt::tt_metal::Device*, tt::tt_metal::MemoryConfig const&)
...

@pilkicTT do you happen to have some context around this failure? It fails on one of the conv tests. @vmilosevic did the pipeline fail on the same test after the re-run?

vmilosevic commented 2 months ago

Yes, this is reproducible. @dgolubovicTT also noticed the same issue

pilkicTT commented 2 months ago

The following commit: https://github.com/tenstorrent/tt-mlir/commit/2e170337a092a3bff05494b3b7b16a375261daf4 is triggering the issue.

We should probably revert it...

nvukobratTT commented 2 months ago

Yes, this is reproducible. @dgolubovicTT also noticed the same issue

@vmilosevic can you open up an issue for @jnie-TT to disable this by default?

@nsmithtt @jnie-TT are there any functional benefits at the moment that we're utilizing with async mode? Do we have enough context how and when it works on tt-meta?

vmilosevic commented 2 months ago

@nvukobratTT Sorry, I don't have enough context on what needs to change and why. It's best if the component owner can create this ticket. I would like to see some basic RCA on this, what was broken and why, and how we can cover this type of error with unit test.

pilkicTT commented 2 months ago

@nsmithtt @jnie-TT are there any functional benefits at the moment that we're utilizing with async mode? Do we have enough context how and when it works on tt-meta?

Even though this change might have just exposed a bug, IMO we should refrain from changing defaults (with respect to tt-metal) at this point in time. We are not mature enough, don't have enough testing mechanisms in place, etc. Also, I would assume there is a reason why it's not enabled by default in tt-metal.

Even if we decide that we want to go down that path, there should be mechanisms for controlling the settings if things go south...

nsmithtt commented 2 months ago

I think @jnie-TT has a real fix for this, that said, @jnie-TT let's disable async mode for now, we can reenable at some point in the future, but it's not needed for V0.

nsmithtt commented 2 months ago

I think we should also establish a policy:

tt-mlir uplifts metal, gets green PR
Does not land
tt-forge-fe uplifts tt-mlir and points to PR branch, gets green
Only after green forge-fe can we land tt-mlir
forge-fe uplifts at its convenience

I think we should do this until we have the mono-repo setup.

nvukobratTT commented 2 months ago

I think we should also establish a policy:

tt-mlir uplifts metal, gets green PR

Does not land

tt-forge-fe uplifts tt-mlir and points to PR branch, gets green

Only after green forge-fe can we land tt-mlir

forge-fe uplifts at its convenience

I think we should do this until we have the mono-repo setup.

Yup, this makes sense.

To add on this, FFE vs. MLIR should be uplifted more often. To do so, we can have 2 proces in place:

FFE + MLIR integration testing. If FFE tests are passing, we create PR automatically and merge changes.
FFE + MLIR + Metal. What Nick currently proposed.

If we see value in more often Metal uplifts, we can merge the two into single process.

jnie-TT commented 2 months ago

FYI merged in the fix that also disables async mode by default

nvukobratTT commented 1 month ago

Discussion continued on PR as well:

https://github.com/tenstorrent/tt-forge-fe/pull/307

nvukobratTT commented 1 month ago

@vmilosevic we managed to find the root cause for a good chunk of mentioned failures. Here is the first issue on the MLIR side that describes which change causes FFE regressions:

https://github.com/tenstorrent/tt-mlir/issues/813

As @sdjordjevicTT is on vacation, can you @mtopalovicTT take a look into this regression, or find someone who can check this change out? :))

nvukobratTT commented 1 month ago

@sdjordjevicTT seems like the latest MLIR solves this uplift. I didn't have time to check commit diff, but one of the latest changes solved existing issues.

In sum, there is potential that we just hid the bug. For now, we're going to uplift MLIR as all tests are passing in order to unblock Llama 3B, but if you have time please double-check what happened and confirm is this fix intentional, and does it solves issue at hand :)) Once you confirm it, feel free to close this issue :))

P.S. Moving priority from P0 -> P1

sdjordjevicTT commented 1 month ago

@svuckovicTT will look at it next week when he comes back from vacation.

nvukobratTT commented 1 month ago

@sdjordjevicTT @svuckovicTT can we create a followup issue for this one on the MLIR repo? Just to close it on FFE as it's not a blocker anymore :))

svuckovicTT commented 1 month ago

I'm finding it a bit hard to follow what's going on in this issue... Is there an outstanding problem that needs to be dealt with, and is it on forge or mlir side? @nvukobratTT

nvukobratTT commented 1 month ago

I'm finding it a bit hard to follow what's going on in this issue... Is there an outstanding problem that needs to be dealt with, and is it on the forge or MLIR side? @nvukobratTT

This isn't a blocker anymore for FFE.

This commit was causing issues for a specific group of tests, however one of the newer ones at that time solved the issue. However, we're not sure if we just hid the underlying bug or if it was solved. All in all, feel free to sync with @sdjordjevicTT regarding your priorities, and is this something worth exploring atm :))

tenstorrent / tt-forge-fe

Error in tt-mlir uplift #309