tenstorrent / tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.
Apache License 2.0
415 stars 53 forks source link

[Bug Report] Bilinear Upsampling seems to cause a problem #13039

Open athul-bos-semi opened 6 days ago

athul-bos-semi commented 6 days ago

Describe the bug The program was working, until after Bilinear Upsampling stage. The very next convolution gets stuck in the _HWCommandQueue_writebuffer stage according to Tracy. I’m also attaching the results of Tracy for reference.

To Reproduce Steps to reproduce the behavior:

  1. tt-smi -r 2
  2. python text_unet.py (You can find the files here)

Expected behavior The program should run to completion.

Screenshots What a Convolution is supposed to look like What this Convolution looks like More details on what this Convolution looks like

Environment Information:

Additional context Use Tracy when checking. Also use L1 Buffer analyzer if available.

mywoodstock commented 5 days ago

@athul-bos-semi The example seems to use deprecated tt_lib library. Which version of tt-metal are you using? It would be better to port it to the latest ttnn api.

athul-bos-semi commented 4 days ago

I ported to v0.51.0, but ttnn.reshard does not work. The I ported it to main branch where reshard works, but convolution throws L1 memory error for code that was previously working.

mywoodstock commented 3 days ago

but convolution throws L1 memory error for code that was previously working.

Can you please provide repro details on this? Since you say this regression is with the latest main version, we need to look into that issue first.

athul-bos-semi commented 3 days ago

You can download both the files from here and run the test_unet.py file to reproduce the errors.

mywoodstock commented 2 days ago

I mean how to repro the conv L1 memory error that you get with the latest main version of the repo, not the old version.

athul-bos-semi commented 7 hours ago

The main has already changed, now the error is different. Can you try and run these two files? I am now running on specifc versions to avoid confusion instead of running on main.