tenstorrent / tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.
https://docs.tenstorrent.com/ttnn/latest/index.html
Apache License 2.0
486 stars 80 forks source link

Enable python unit tests to run on wormhole_b0 #5258

Closed acejkov closed 9 months ago

acejkov commented 9 months ago

Describe the bug Large number of unit tests are skipped on b0:

tests/tt_eager/python_api_testing/unit_testing/test_attn_matmul.py:from models.utility_functions import comp_pcc, skip_for_wormhole_b0 tests/tt_eager/python_api_testing/unit_testing/test_average_pool.py:from models.utility_functions import skip_for_wormhole_b0 tests/tt_eager/python_api_testing/unit_testing/test_average_pool.py:@skip_for_wormhole_b0() tests/tt_eager/python_api_testing/unit_testing/test_bert_ops.py:from models.utility_functions import is_wormhole_b0, is_grayskull, skip_for_wormhole_b0 tests/tt_eager/python_api_testing/unit_testing/test_bert_ops.py:# # @skip_for_wormhole_b0("WH ND hang, see issue #4392") tests/tt_eager/python_api_testing/unit_testing/test_bert_sharded.py:from models.utility_functions import skip_for_wormhole_b0 tests/tt_eager/python_api_testing/unit_testing/test_bert_sharded.py:@skip_for_wormhole_b0() tests/tt_eager/python_api_testing/unit_testing/test_concat.py:from models.utility_functions import skip_for_wormhole_b0 tests/tt_eager/python_api_testing/unit_testing/test_concat.py:@skip_for_wormhole_b0() tests/tt_eager/python_api_testing/unit_testing/test_concat.py:@skip_for_wormhole_b0() tests/tt_eager/python_api_testing/unit_testing/test_concat.py:@skip_for_wormhole_b0() tests/tt_eager/python_api_testing/unit_testing/test_downsample.py:from models.utility_functions import skip_for_wormhole_b0 tests/tt_eager/python_api_testing/unit_testing/test_downsample.py:@skip_for_wormhole_b0() tests/tt_eager/python_api_testing/unit_testing/test_embedding.py:from models.utility_functions import skip_for_wormhole_b0 tests/tt_eager/python_api_testing/unit_testing/test_embedding.py:@skip_for_wormhole_b0() tests/tt_eager/python_api_testing/unit_testing/test_eps.py:from models.utility_functions import is_wormhole_b0, skip_for_wormhole_b0 tests/tt_eager/python_api_testing/unit_testing/test_fully_connected.py:from models.utility_functions import is_wormhole_b0, skip_for_wormhole_b0 tests/tt_eager/python_api_testing/unit_testing/test_fully_connected.py:@skip_for_wormhole_b0() tests/tt_eager/python_api_testing/unit_testing/test_groupnorm_sharded.py:from models.utility_functions import torch2tt_tensor, tt2torch_tensor, pad_by_zero, skip_for_wormhole_b0 tests/tt_eager/python_api_testing/unit_testing/test_groupnorm_sharded.py:@skip_for_wormhole_b0() tests/tt_eager/python_api_testing/unit_testing/test_layernorm.py:from models.utility_functions import is_wormhole_b0, skip_for_wormhole_b0 tests/tt_eager/python_api_testing/unit_testing/test_layernorm.py:@skip_for_wormhole_b0() tests/tt_eager/python_api_testing/unit_testing/test_layernorm.py:@skip_for_wormhole_b0() tests/tt_eager/python_api_testing/unit_testing/test_layernorm_sharded.py:from models.utility_functions import is_wormhole_b0, skip_for_wormhole_b0 tests/tt_eager/python_api_testing/unit_testing/test_layernorm_sharded.py:@skip_for_wormhole_b0() tests/tt_eager/python_api_testing/unit_testing/test_layernorm_sharded.py:@skip_for_wormhole_b0() tests/tt_eager/python_api_testing/unit_testing/test_layernorm_sharded.py:@skip_for_wormhole_b0() tests/tt_eager/python_api_testing/unit_testing/test_max_pool.py:from models.utility_functions import skip_for_wormhole_b0 tests/tt_eager/python_api_testing/unit_testing/test_max_pool.py:@skip_for_wormhole_b0() tests/tt_eager/python_api_testing/unit_testing/test_moreh_clip_grad_norm.py:from models.utility_functions import comp_allclose_and_pcc, skip_for_wormhole_b0 tests/tt_eager/python_api_testing/unit_testing/test_moreh_clip_grad_norm.py:@skip_for_wormhole_b0() tests/tt_eager/python_api_testing/unit_testing/test_moreh_clip_grad_norm.py:# @skip_for_wormhole_b0() tests/tt_eager/python_api_testing/unit_testing/test_moreh_layernorm.py:from models.utility_functions import comp_allclose_and_pcc, skip_for_wormhole_b0 tests/tt_eager/python_api_testing/unit_testing/test_moreh_layernorm.py:@skip_for_wormhole_b0() tests/tt_eager/python_api_testing/unit_testing/test_moreh_matmul.py:from models.utility_functions import comp_allclose_and_pcc, skip_for_wormhole_b0 tests/tt_eager/python_api_testing/unit_testing/test_moreh_matmul.py:@skip_for_wormhole_b0() tests/tt_eager/python_api_testing/unit_testing/test_moreh_matmul.py:@skip_for_wormhole_b0() tests/tt_eager/python_api_testing/unit_testing/test_moreh_matmul.py:@skip_for_wormhole_b0() tests/tt_eager/python_api_testing/unit_testing/test_moreh_matmul.py:@skip_for_wormhole_b0() tests/tt_eager/python_api_testing/unit_testing/test_moreh_matmul.py:@skip_for_wormhole_b0() tests/tt_eager/python_api_testing/unit_testing/test_moreh_sum.py:from models.utility_functions import comp_allclose_and_pcc, skip_for_wormhole_b0 tests/tt_eager/python_api_testing/unit_testing/test_moreh_sum.py:@skip_for_wormhole_b0() tests/tt_eager/python_api_testing/unit_testing/test_move_sharded.py:from models.utility_functions import skip_for_wormhole_b0 tests/tt_eager/python_api_testing/unit_testing/test_move_sharded.py:@skip_for_wormhole_b0() tests/tt_eager/python_api_testing/unit_testing/test_nlp_concat_heads.py:from models.utility_functions import is_wormhole_b0, skip_for_wormhole_b0 tests/tt_eager/python_api_testing/unit_testing/test_nlp_concat_heads.py:@skip_for_wormhole_b0() tests/tt_eager/python_api_testing/unit_testing/test_nlp_concat_heads.py:@skip_for_wormhole_b0() tests/tt_eager/python_api_testing/unit_testing/test_optimized_conv.py:from models.utility_functions import print_diff_argmax, is_close, comp_pcc, comp_allclose_and_pcc, skip_for_wormhole_b0 tests/tt_eager/python_api_testing/unit_testing/test_optimized_conv.py:@skip_for_wormhole_b0() tests/tt_eager/python_api_testing/unit_testing/test_optimized_conv_multi_core.py:from models.utility_functions import print_diff_argmax, is_close, comp_pcc, comp_allclose_and_pcc, skip_for_wormhole_b0 tests/tt_eager/python_api_testing/unit_testing/test_optimized_conv_multi_core.py:@skip_for_wormhole_b0() tests/tt_eager/python_api_testing/unit_testing/test_optimized_conv_v2.py:from models.utility_functions import skip_for_wormhole_b0 tests/tt_eager/python_api_testing/unit_testing/test_optimized_conv_v2.py:@skip_for_wormhole_b0() tests/tt_eager/python_api_testing/unit_testing/test_pow_fractional.py:from models.utility_functions import skip_for_wormhole_b0 tests/tt_eager/python_api_testing/unit_testing/test_pow_fractional.py:@skip_for_wormhole_b0() tests/tt_eager/python_api_testing/unit_testing/test_resnet50_first_conv.py:from models.utility_functions import is_wormhole_b0, skip_for_wormhole_b0 tests/tt_eager/python_api_testing/unit_testing/test_resnet50_first_conv.py:@skip_for_wormhole_b0() tests/tt_eager/python_api_testing/unit_testing/test_resnet50_first_conv_folding_on_host.py: skip_for_wormhole_b0, tests/tt_eager/python_api_testing/unit_testing/test_resnet50_first_conv_folding_on_host.py:@skip_for_wormhole_b0() tests/tt_eager/python_api_testing/unit_testing/test_resnet50_untilize_with_halo_and_conv_v2.py:from models.utility_functions import skip_for_wormhole_b0 tests/tt_eager/python_api_testing/unit_testing/test_resnet50_untilize_with_halo_and_conv_v2.py:@skip_for_wormhole_b0() tests/tt_eager/python_api_testing/unit_testing/test_rmsnorm.py:from models.utility_functions import is_wormhole_b0, skip_for_wormhole_b0 tests/tt_eager/python_api_testing/unit_testing/test_rmsnorm.py:@skip_for_wormhole_b0() tests/tt_eager/python_api_testing/unit_testing/test_rotate_half.py:from models.utility_functions import skip_for_wormhole_b0 tests/tt_eager/python_api_testing/unit_testing/test_rotate_half.py:@skip_for_wormhole_b0() tests/tt_eager/python_api_testing/unit_testing/test_sfpu_chain.py:from models.utility_functions import is_wormhole_b0, skip_for_wormhole_b0 tests/tt_eager/python_api_testing/unit_testing/test_sfpu_chain.py:@skip_for_wormhole_b0() tests/tt_eager/python_api_testing/unit_testing/test_sfpu_chain.py:@skip_for_wormhole_b0() tests/tt_eager/python_api_testing/unit_testing/test_single_core_fused_ops.py:from models.utility_functions import skip_for_wormhole_b0 tests/tt_eager/python_api_testing/unit_testing/test_untilize_with_halo_and_max_pool_v2.py:from models.utility_functions import is_wormhole_b0, skip_for_wormhole_b0 tests/tt_eager/python_api_testing/unit_testing/test_untilize_with_halo_and_max_pool_v2.py:@skip_for_wormhole_b0() tests/tt_eager/python_api_testing/unit_testing/test_untilize_with_halo_v2.py:from models.utility_functions import skip_for_wormhole_b0 tests/tt_eager/python_api_testing/unit_testing/test_untilize_with_halo_v2.py:@skip_for_wormhole_b0() skip_for_wormhole_b0 tests/tt_eager/python_api_testing/unit_testing/test_single_core_fused_ops.py skip_for_wormhole_b0 tests/tt_eager/python_api_testing/unit_testing/test_softmax_sharded.py

To Reproduce Run any of the aforementioned tests on wh_b0 and look at the list of skipped tests

e.g pytest -svv tests/tt_eager/python_api_testing/unit_testing/test_softmax_sharded.py

Expected behavior Test needs to run and pass on wh_b0 arch. If test is not applicable for specific arch there has to be a clear message on why specific test can't run (e.g. unsupported feature, not enough resources etc.)

muthutt commented 9 months ago

33 test files, I suspect most of the skip_for_wormhole_b0() decorators can be removed,

 1  tests/tt_eager/python_api_testing/unit_testing/test_attn_matmul.py
     2  tests/tt_eager/python_api_testing/unit_testing/test_average_pool.py
     3  tests/tt_eager/python_api_testing/unit_testing/test_bert_ops.py
     4  tests/tt_eager/python_api_testing/unit_testing/test_bert_sharded.py
     5  tests/tt_eager/python_api_testing/unit_testing/test_concat.py
     6  tests/tt_eager/python_api_testing/unit_testing/test_downsample.py
     7  tests/tt_eager/python_api_testing/unit_testing/test_embedding.py
     8  tests/tt_eager/python_api_testing/unit_testing/test_eps.py
     9  tests/tt_eager/python_api_testing/unit_testing/test_fully_connected.py
    10  tests/tt_eager/python_api_testing/unit_testing/test_groupnorm_sharded.py
    11  tests/tt_eager/python_api_testing/unit_testing/test_layernorm.py
    12  tests/tt_eager/python_api_testing/unit_testing/test_layernorm_sharded.py
    13  tests/tt_eager/python_api_testing/unit_testing/test_max_pool.py
    14  tests/tt_eager/python_api_testing/unit_testing/test_moreh_clip_grad_norm.py
    15  tests/tt_eager/python_api_testing/unit_testing/test_moreh_layernorm.py
    16  tests/tt_eager/python_api_testing/unit_testing/test_moreh_matmul.py
    17  tests/tt_eager/python_api_testing/unit_testing/test_moreh_sum.py
    18  tests/tt_eager/python_api_testing/unit_testing/test_move_sharded.py
    19  tests/tt_eager/python_api_testing/unit_testing/test_nlp_concat_heads.py
    20  tests/tt_eager/python_api_testing/unit_testing/test_optimized_conv.py
    21  tests/tt_eager/python_api_testing/unit_testing/test_optimized_conv_multi_core.py
    22  tests/tt_eager/python_api_testing/unit_testing/test_optimized_conv_v2.py
    23  tests/tt_eager/python_api_testing/unit_testing/test_pow_fractional.py
    24  tests/tt_eager/python_api_testing/unit_testing/test_resnet50_first_conv.py
    25  tests/tt_eager/python_api_testing/unit_testing/test_resnet50_first_conv_folding_on_host.py
    26  tests/tt_eager/python_api_testing/unit_testing/test_resnet50_untilize_with_halo_and_conv_v2.py
    27  tests/tt_eager/python_api_testing/unit_testing/test_rmsnorm.py
    28  tests/tt_eager/python_api_testing/unit_testing/test_rotate_half.py
    29  tests/tt_eager/python_api_testing/unit_testing/test_sfpu_chain.py
    30  tests/tt_eager/python_api_testing/unit_testing/test_single_core_fused_ops.py
    31  tests/tt_eager/python_api_testing/unit_testing/test_softmax_sharded.py
    32  tests/tt_eager/python_api_testing/unit_testing/test_untilize_with_halo_and_max_pool_v2.py
    33  tests/tt_eager/python_api_testing/unit_testing/test_untilize_with_halo_v2.py
acejkov commented 9 months ago

test_softmax_sharded requires fix on Yu's branch yugao/gs_wh_block_matmul_hang

Hope others will work as well

acejkov commented 9 months ago

@muthutt all resnet tests fail with this error

             Always | FATAL    | No L1 bank exists for core (x=8,y=0)

FAILED

do you know if anybody is looking into this?

muthutt commented 9 months ago

not sure; it maybe an issue with WHB0 n150 vs n300 which has some fewer rows on tile than other one (reclaimed cores?) Which machine did you use ? Can you try the other one as well before filing the bug (what you have is a bug - the operator support should have raised an error saying implementation for this operator (whatever it is) is not supported on this particular type of wormhole arch.

Thanks

On Fri, Feb 9, 2024 at 10:10 AM acejkov @.***> wrote:

@muthutt https://github.com/muthutt all resnet tests fail with this error

         Always | FATAL    | No L1 bank exists for core (x=8,y=0)

FAILED

do you know if anybody is looking into this?

— Reply to this email directly, view it on GitHub https://github.com/tenstorrent-metal/tt-metal/issues/5258#issuecomment-1936386860, or unsubscribe https://github.com/notifications/unsubscribe-auth/BAGOCNDBBEI37WLDX6XS2ILYSZRDHAVCNFSM6AAAAABDB3FXESVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMZWGM4DMOBWGA . You are receiving this because you were mentioned.Message ID: @.***>

acejkov commented 9 months ago

no worries, it's a hardcoded grid size in some of the tests I'm adjusting grid size based on the device we run on and available cores

muthutt commented 9 months ago

@acejkov - most of tests for simple eltwise ops should have parity on WHB0 and work with just changing PCC if anything needs changing.

ttmtrajkovic commented 9 months ago

@acejkov - most of tests for simple eltwise ops should have parity on WHB0 and work with just changing PCC if anything needs changing.

@muthutt, what kind of PCC changing do you expect? the PCC target or something else?

muthutt commented 9 months ago

just by a few percent points; e.g. 0.99 -> 0.98 perhaps something small

On Fri, Feb 9, 2024 at 2:34 PM Milos Trajkovic @.***> wrote:

@acejkov https://github.com/acejkov - most of tests for simple eltwise ops should have parity on WHB0 and work with just changing PCC if anything needs changing.

@muthutt https://github.com/muthutt, what kind of PCC changing do you expect? the PCC target or something else?

— Reply to this email directly, view it on GitHub https://github.com/tenstorrent-metal/tt-metal/issues/5258#issuecomment-1936692596, or unsubscribe https://github.com/notifications/unsubscribe-auth/BAGOCNBXE7NNQHGINNKO27LYS2QA5AVCNFSM6AAAAABDB3FXESVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMZWGY4TENJZGY . You are receiving this because you were mentioned.Message ID: @.***>

acejkov commented 9 months ago

Following tests remain to be enabled: tests/tt_eager/python_api_testing/unit_testing/testmoreh*.py

tests/tt_eager/python_api_testing/unit_testing/test_move_sharded.py tests/tt_eager/python_api_testing/unit_testing/test_groupnorm_sharded.py:

tests/tt_eager/python_api_testing/unit_testing/test_optimized_conv.py tests/tt_eager/python_api_testing/unit_testing/test_optimized_conv_multi_core.py tests/tt_eager/python_api_testing/unit_testing/test_optimized_conv_v2.py tests/tt_eager/python_api_testing/unit_testing/test_resnet50_first_conv.py tests/tt_eager/python_api_testing/unit_testing/test_resnet50_first_conv_folding_on_host.py tests/tt_eager/python_api_testing/unit_testing/test_resnet50_untilize_with_halo_and_conv_v2.py tests/tt_eager/python_api_testing/unit_testing/test_untilize_with_halo_and_max_pool_v2.py tests/tt_eager/python_api_testing/unit_testing/test_untilize_with_halo_v2.py

acejkov commented 9 months ago

latest merge https://github.com/tenstorrent-metal/tt-metal/pull/5425 enables following tests

tests/tt_eager/python_api_testing/unit_testing/test_optimized_conv.py tests/tt_eager/python_api_testing/unit_testing/test_optimized_conv_multi_core.py tests/tt_eager/python_api_testing/unit_testing/test_resnet50_first_conv.py tests/tt_eager/python_api_testing/unit_testing/test_resnet50_first_conv_folding_on_host.py tests/tt_eager/python_api_testing/unit_testing/test_resnet50_untilize_with_halo_and_conv_v2.py tests/tt_eager/python_api_testing/unit_testing/test_untilize_with_halo_v2.py

Remaining tests will be enabled as part of ttnn unit tests.