Bad UNet Shallow model PCC when running multiple iterations with program cache

Summary

UNet Shallow gives bad PCC after two iterations when program cache is enabled.

At a high level, we can reproduce this issue by doing the following steps:

Run UNet Shallow end-to-end
Run Unet Shallow end-to-end
Run UNet Shallow's Upblock4 and check PCC against the reference model

This PCC between the model and the reference model will be bad. The PCC become good (0.99) when program cache is disabled.

Reproducing the error

Checkout and build 14cf8346e42c0f6b67b8b38860764f68d5a3bce8 on N150/N300. Enable 8x8 grid if using N300.

Run the following command:

pytest models/experimental/functional_unet/tests/test_unet_upblock.py::test_unet_upblock[device_params0-upblock4-16-528-80-16-2-1]

This should fail. If you open models/experimental/functional_unet/tests/test_unet_upblock.py to disable program cache in test_unet_upblock, it should pass.

Investigation

Looking at the PCC of each individual layer in Upblock4 shows the PCC drops off after the second convolution.
Turning on Watcher will detect a bad NOC transaction on the second model iteration, but only when program cache is enabled.

Device 0 worker core(x= 0,y= 0) phys(x= 1,y= 2): ncrisc using noc1 tried to access Tensix core w/ physical coords (x=8,y=9) L1[addr=0x00178e80,len=192]

Specifically, the Watcher error occurs in conv3 of upblock4. The bad read is happening in reader kernel when reading the activations, if one of these reads is uncommented we will trigger the watcher error: https://github.com/tenstorrent/tt-metal/blob/6d3424caf2597a00a4815a897c0ab4efc566f[…]els/reader_conv_activations_padded_with_halo_3x3_weights_v2.cpp
- Looking at the reader indices config buffer before launching conv2 of upblock4 shows garbage values. It seems like some op is writing into this L1 small space.

Note that you may need to disable the bias to enable watcher because the convolution code size is too large.
This issue surfaces when I modified how we send inputs to device, previously we would first go from host to L1 interleaved, and then the first convolution would shard it, but now we go directly from host to L1 height sharded.

tenstorrent / tt-metal