tenstorrent / tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.
Apache License 2.0
421 stars 54 forks source link

PCC error for `mul` with sharded input #9773

Closed yieldthought closed 3 months ago

yieldthought commented 3 months ago

Painfully discovered and debugged during Grok-1 bringup. The mul operator (and inline * operator) do not support sharded inputs. They silently produce bad results for some of the cores. Oh, and their output is interleaved instead of respecting the input memory config and remaining sharded.

Added a unit test for this to /home/moconnor/main-metal/tests/ttnn/unit_tests/operations/test_mul.py (test_multiply_with_scalar_sharded) which fails on main. Available in branch yieldthought/mul_sharded_scalar.

TT-BrianLiu commented 3 months ago

The output memory_config is forced to interleaved always for scalar mul and it's related to this issue: https://github.com/tenstorrent/tt-metal/issues/7637

The default should be the memory_config of the input tensor, which is HEIGHT_SHARDED in this case. The following fix will get your test to pass and also have the output be sharded:

image

There are two issues:

TT-BrianLiu commented 3 months ago

I identified what was causing the interleaved PCC issue. For sharded inputs, we need to be mindful of the shard orientation when distributing output work. What was happening was input shards are row-major and the work was being distributed col-major (which is the default).

This PR fixes that and also properly defaults output memory config to the input memory config when nothing is specified: https://github.com/tenstorrent/tt-metal/pull/9939

Side note: I picked up your test case and added it as well. I used ttnn.mul instead of * because I wanted to pass in the output memory config for testing.

TT-BrianLiu commented 3 months ago

Merged: https://github.com/tenstorrent/tt-metal/pull/9939