tenstorrent / tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.
Apache License 2.0
377 stars 45 forks source link

can't run all convs in Vanilla UNet (brain MRI model) on device #5857

Open dvartaniansTT opened 6 months ago

dvartaniansTT commented 6 months ago

Describe the bug We are adding 2 variants of vanilla UNet to run on device via TTNN. Given these variants have trained weights, we can make a demo of our UNet efforts and track PCC robustly per op. We are currently blocked for this given the issues bellow with the conv ops to run on device. In short, the last conv/classifier won't run on device and I can't find a single sharding strategy (either height or block sharding) for all the convs in the model that would run on device. Please read through for instructions on how to reproduce the issues and more details.

This issue is regarding the BRAIN MRI SEGMENTATION variant

To Reproduce Steps to reproduce the behavior:

  1. my branch at commit e110ed62fd2ef1cc8953fd1d8c62b600cf53e8be
  2. run my unit test for the convs in the model as pytest tests/ttnn/unit_tests/operations/test_conv2d.py::test_unet_brain_conv
  3. As of now, I have down-sized the unit tests to include math fidelity=LoFi and data-type=BFLOAT8_B only. There are two issues: 1. last convolution/the classifier in UNet fails with block sharding and fails with height sharding as well. I can not find a config override for act_block_h that would pass for the height sharding case 2. I can't find configurations where all convs would pass as either block sharding or height sharding end to end. Meaning, we would need to change sharding strategy in between convs in the end to end model which can be expensive
  4. Please see attached the model's graph end to end for your reference. I have commented every test case with the corresponding op from the graph for your convenience as in here

Expected behavior Enable the last 1x1 conv to run on device. Find the optimal configurations for block vs height sharding to run the model end to end.

environment information:

jliangTT commented 5 months ago

Assigning to @nsmithtt to triage - putting this as p2 while we discuss the priority of these items offline.

dvartaniansTT commented 4 months ago

@nsmithtt just wondering if you've had a chance to look at this one? cc: @mbahnasTT