tenstorrent / tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.
https://docs.tenstorrent.com/ttnn/latest/index.html
Apache License 2.0
485 stars 79 forks source link

Using block sharded config with height < 32 leads to low PCC on multiple ops #15259

Open nemanjagrujic opened 5 days ago

nemanjagrujic commented 5 days ago

Using block sharded config with height < 32 leads to low PCC on isfinite and similar ops. Provided unit test gives multiple configurations where we observe problem. Some of them are:

input_shape: [32, 4, 8, 768]
ttnn.bfloat16,
ttnn.TILE_LAYOUT,
ttnn.ShardStrategy.BLOCK,
ttnn.ShardOrientation.COL_MAJOR,
hw_as_shard_shape: False
input_shape: [32, 4, 8, 768]
ttnn.bfloat16,
ttnn.TILE_LAYOUT,
ttnn.ShardStrategy.BLOCK,
ttnn.ShardOrientation.ROW_MAJOR,
hw_as_shard_shape: False
input_shape: [32, 4, 8, 768]
ttnn.bfloat8_b,
ttnn.TILE_LAYOUT,
ttnn.ShardStrategy.BLOCK,
ttnn.ShardOrientation.ROW_MAJOR,
hw_as_shard_shape: False

Used grid size is: (8, 8).

Ops found affected:

ttnn.exp
ttnn.cos
ttnn.sin
ttnn.abs
ttnn.isfinite
ttnn.isnan
ttnn.isinf
ttnn.isneginf
ttnn.isposinf

Also problem is observed when no op is called (just copy tensor back and forth).

Example shapes that lead to problem can be:

[256, 2, 5, 1536]
[1, 256, 2, 2304]
[32, 4, 8, 768]
pytest  tests/ttnn/python_api_testing/non_working_unit_tests/wormhole/test_eltwise_block_sharded.py
umadevimcw commented 3 days ago

We need more information on this issue like grid size , row_major or col_major etcc..

nemanjagrujic commented 4 hours ago

@umadevimcw I updated ticket with required information. You can find those in unit test as well.