tenstorrent / tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.
Apache License 2.0
471 stars 74 forks source link

Bad UNet Shallow model PCC when running multiple iterations with program cache #12510

Open esmalTT opened 2 months ago

esmalTT commented 2 months ago

Summary

UNet Shallow gives bad PCC after two iterations when program cache is enabled.

At a high level, we can reproduce this issue by doing the following steps:

  1. Run UNet Shallow end-to-end
  2. Run Unet Shallow end-to-end
  3. Run UNet Shallow's Upblock4 and check PCC against the reference model

This PCC between the model and the reference model will be bad. The PCC become good (0.99) when program cache is disabled.

Reproducing the error

Checkout and build 14cf8346e42c0f6b67b8b38860764f68d5a3bce8 on N150/N300. Enable 8x8 grid if using N300.

Run the following command:

pytest models/experimental/functional_unet/tests/test_unet_upblock.py::test_unet_upblock[device_params0-upblock4-16-528-80-16-2-1]

This should fail. If you open models/experimental/functional_unet/tests/test_unet_upblock.py to disable program cache in test_unet_upblock, it should pass.

Investigation

Device 0 worker core(x= 0,y= 0) phys(x= 1,y= 2): ncrisc using noc1 tried to access Tensix core w/ physical coords (x=8,y=9) L1[addr=0x00178e80,len=192]
esmalTT commented 2 months ago

Downgrading to P1 since there is currently a workaround.

Increasing the L1_SMALL space by 8192 B eliminated the issue.