Closed vmilosevic closed 1 month ago
Another issue is that after the last error, the test run goes into a loop instead of exiting
terminate called recursively
tt_forge_signal_handler - signal: 6 (abort)
terminate called recursively
tt_forge_signal_handler - signal: 6 (abort)
terminate called recursively
tt_forge_signal_handler - signal: 6 (abort)
terminate called recursively
Looks like dual conv2d fails during runtime:
...
2024-09-18 09:09:45.831 | INFO | forge.compiled_graph_state:__call__:330 - Running model DualConv2d on device...
Always | FATAL | Unsupported data type
terminate called after throwing an instance of 'std::runtime_error'
Metal | INFO | Closing device 0
Metal | INFO | Closing device 0
what(): TT_THROW @ /__w/tt-forge-fe/tt-forge-fe/third_party/tt-mlir/third_party/tt-metal/src/tt-metal/ttnn/cpp/ttnn/tensor/tensor_impl.cpp:42: tt::exception
info:
Unsupported data type
backtrace:
--- tt::tt_metal::tensor_impl::element_size_bytes(tt::tt_metal::DataType)
--- tt::tt_metal::create_device_tensor(tt::tt_metal::Shape const&, tt::tt_metal::DataType, tt::tt_metal::Layout, tt::tt_metal::Device*, tt::tt_metal::MemoryConfig const&)
...
@pilkicTT do you happen to have some context around this failure? It fails on one of the conv tests. @vmilosevic did the pipeline fail on the same test after the re-run?
Yes, this is reproducible. @dgolubovicTT also noticed the same issue
The following commit: https://github.com/tenstorrent/tt-mlir/commit/2e170337a092a3bff05494b3b7b16a375261daf4 is triggering the issue.
We should probably revert it...
Yes, this is reproducible. @dgolubovicTT also noticed the same issue
@vmilosevic can you open up an issue for @jnie-TT to disable this by default?
@nsmithtt @jnie-TT are there any functional benefits at the moment that we're utilizing with async mode? Do we have enough context how and when it works on tt-meta?
@nvukobratTT Sorry, I don't have enough context on what needs to change and why. It's best if the component owner can create this ticket. I would like to see some basic RCA on this, what was broken and why, and how we can cover this type of error with unit test.
@nsmithtt @jnie-TT are there any functional benefits at the moment that we're utilizing with async mode? Do we have enough context how and when it works on tt-meta?
Even though this change might have just exposed a bug, IMO we should refrain from changing defaults (with respect to tt-metal) at this point in time. We are not mature enough, don't have enough testing mechanisms in place, etc. Also, I would assume there is a reason why it's not enabled by default in tt-metal.
Even if we decide that we want to go down that path, there should be mechanisms for controlling the settings if things go south...
I think @jnie-TT has a real fix for this, that said, @jnie-TT let's disable async mode for now, we can reenable at some point in the future, but it's not needed for V0.
I think we should also establish a policy:
I think we should do this until we have the mono-repo setup.
I think we should also establish a policy:
- tt-mlir uplifts metal, gets green PR
- Does not land
- tt-forge-fe uplifts tt-mlir and points to PR branch, gets green
- Only after green forge-fe can we land tt-mlir
- forge-fe uplifts at its convenience
I think we should do this until we have the mono-repo setup.
Yup, this makes sense.
To add on this, FFE vs. MLIR should be uplifted more often. To do so, we can have 2 proces in place:
If we see value in more often Metal uplifts, we can merge the two into single process.
Discussion continued on PR as well:
@vmilosevic we managed to find the root cause for a good chunk of mentioned failures. Here is the first issue on the MLIR side that describes which change causes FFE regressions:
As @sdjordjevicTT is on vacation, can you @mtopalovicTT take a look into this regression, or find someone who can check this change out? :))
@sdjordjevicTT seems like the latest MLIR solves this uplift. I didn't have time to check commit diff, but one of the latest changes solved existing issues.
In sum, there is potential that we just hid the bug. For now, we're going to uplift MLIR as all tests are passing in order to unblock Llama 3B, but if you have time please double-check what happened and confirm is this fix intentional, and does it solves issue at hand :)) Once you confirm it, feel free to close this issue :))
P.S. Moving priority from P0 -> P1
@svuckovicTT will look at it next week when he comes back from vacation.
@sdjordjevicTT @svuckovicTT can we create a followup issue for this one on the MLIR repo? Just to close it on FFE as it's not a blocker anymore :))
I'm finding it a bit hard to follow what's going on in this issue... Is there an outstanding problem that needs to be dealt with, and is it on forge or mlir side? @nvukobratTT
I'm finding it a bit hard to follow what's going on in this issue... Is there an outstanding problem that needs to be dealt with, and is it on the forge or MLIR side? @nvukobratTT
This isn't a blocker anymore for FFE.
This commit was causing issues for a specific group of tests, however one of the newer ones at that time solved the issue. However, we're not sure if we just hid the underlying bug or if it was solved. All in all, feel free to sync with @sdjordjevicTT regarding your priorities, and is this something worth exploring atm :))
Running tests with latest tt-mlir fails with the following errors:
For a full log look at the CI run https://github.com/tenstorrent/tt-forge-fe/actions/runs/10918549924 PR https://github.com/tenstorrent/tt-forge-fe/pull/307