Sharded tensor move to host `.cpu()` hangs for a particular config in Fast Dispatch mode

Describe the bug The following unit test hangs at the .cpu() call when run in fast dispatch: pytest tests/tt_eager/python_api_testing/unit_testing/test_untilize_with_halo_v2.py::test_generate_all_configs_and_references[False-conv_params22-20-input_chw_shape22-98-grid_size22-False]

No issues with slow dispatch mode.

Performing a sharded_to_interleaved and then moving to host with .cpu() works fine too. So the sharded tensor move to host hangs.

To Reproduce

Branch: bliu/issue-4319 Test: pytest tests/tt_eager/python_api_testing/unit_testing/test_untilize_with_halo_v2.py::test_generate_all_configs_and_references[False-conv_params23-20-input_chw_shape23-98-grid_size23-False]

The input tensor shape to this unit test (UTWHv2) is: [1, 1, 62720, 128] and the constructed output is the following shape: [1, 1, 98 * 913, 128], each shard being [913, 128] sized. Datatype is BFLOAT16

tenstorrent / tt-metal

Sharded tensor move to host `.cpu()` hangs for a particular config in Fast Dispatch mode #4319