tenstorrent / tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.
Apache License 2.0
399 stars 50 forks source link

Incorrect Output in Multi-Tile Processing When Using Mixed uint8 and bfloat16 Formats in Compute Kernel #11962

Open dongjin-na opened 3 weeks ago

dongjin-na commented 3 weeks ago

Describe the bug We are experiencing issues with using different types of CBs (bfloat16 and uint8) in the compute kernel. When processing a single tile in a loop, everything works as expected, but when iterating and handling two or more tiles, specific faces in the resulting tiles are incorrectly stored as 0.

The strange part is that the issue doesn’t occur when I add a dprint in the reader. Because of this, I suspect something is missing in the kernel or host code that I wrote. Could you please help check for any potential issues?

To Reproduce Steps to reproduce the behavior: This was detected in this bug https://github.com/tenstorrent/tt-metal/issues/11000 and this example can be used.

Expected behavior The expected output should be produced.

razorback3 commented 3 weeks ago

Note: Currently the highest priority issue among LLK related issues.

rdjogoTT commented 3 weeks ago

@dongjin-na I managed to solve the issue locally by adding a stall on the unpack_reconfig call. Can you please test it out on your branch to confirm the fix works. The fix is found on the rd/stall_unpack_reconfig branch for the wh-b0 submodule, so you can just check that commit out.

dongjin-na commented 2 weeks ago

Dear @rdjogoTT,

I tested the script and other related implementations using the patch you provided.

Based on these results, I have the following requests and questions:

  1. Could you please proceed with merging this patch into the main branch?
  2. Is this patch considered a complete solution? As mentioned earlier, it seems that when a kernel uses multiple SFPU and FPU compute APIs, the issue can still occur even after applying the patch.
dongjin-na commented 2 weeks ago

Dear @rdjogoTT,

Additionally, I'd like to inform you that I’ve updated the test kernel, which still fails even after applying the patch you provided. Could you please check if there might be an issue with the kernel implementation or if the patch might need further improvement?

Thank you in advance for your insights, and I look forward to your response.

test kernel

Here are the test results:

rdjogoTT commented 2 weeks ago

@dongjin-na I did expect the patch to be a complete solution, I will investigate this failing case now.

rdjogoTT commented 2 weeks ago

Progress update: Adding a UNPACK(( tensix_sync() )); at the end of the loop fixes the issue with the 4th face being all 0's. I am investigating deeper as to why, as this is a more extreme measure than should be necessary.

rdjogoTT commented 2 weeks ago

Looks like we've found the race, working formulating a solution and on a PR. It's a bit of a bigger one and will require substantial testing

rdjogoTT commented 5 days ago

@dongjin-na The fix is on the updated rd/stall_unpack_reconfig branch. Please try to run the test again to see if it works for you, I was able to get all tests to pass locally.

ilkoo-lee commented 1 day ago

@rdjogoTT Hello, I have confirmed that the previously failing tests now pass with the provided patch. However, I found a new issue related to this in another use case. Since this issue became too lengthy, I have separated it into a new issue. Please refer to https://github.com/tenstorrent/tt-metal/issues/12963. Thank you.