ttnn.moreh_cumsum got runtime issue

pschang-phy commented 2 weeks ago

Description

When lowering torch.ops.aten.cumsum.default to ttnn.moreh_cumsum, got runtime issue.

Error message

self = FastOperation(python_fully_qualified_name='ttnn.moreh_cumsum', function=<ttnn._ttnn.operations.moreh.moreh_cumsum_t ob...uts=None, golden_function=None, postprocess_golden_function_outputs=None, is_cpp_operation=True, is_experimental=False)
function_args = (<[RuntimeError("TT_FATAL @ /tmp/build-via-sdist-n2_feo7u/metal_libs-0.53.0rc28+wormhole.b0/tt_metal/impl/device/devic...t_env/bin/python3(_PyEval_EvalCodeWithName+0x26a) [0x55abda]\n") raised in repr()] Tensor object at 0x7f0ddac3c0f0>, 1)
function_kwargs = {}

    def __call__(self, *function_args, **function_kwargs):
>       return self.function(*function_args, **function_kwargs)
E       RuntimeError: TT_FATAL @ /tmp/build-via-sdist-n2_feo7u/metal_libs-0.53.0rc28+wormhole.b0/ttnn/cpp/ttnn/tensor/types.cpp:239: normalized_index >= 0 and normalized_index < rank
E       info:
E       Index is out of bounds for the rank, should be between 0 and 1 however is 2
E       backtrace:
E        --- /home/ubuntu/tt_env/lib/python3.8/site-packages/ttnn/_ttnn.cpython-38-x86_64-linux-gnu.so(+0x1052769) [0x7f0e44ca1769]
E        --- /home/ubuntu/tt_env/lib/python3.8/site-packages/ttnn/_ttnn.cpython-38-x86_64-linux-gnu.so(+0x11be26a) [0x7f0e44e0d26a]
E        --- ttnn::operations::moreh::moreh_cumsum::MorehCumsumDeviceOperation::ProgramFactory::create(ttnn::operations::moreh::moreh_cumsum::MorehCumsumDeviceOperation::operation_attributes_t const&, ttnn::operations::moreh::moreh_cumsum::MorehCumsumDeviceOperation::tensor_args_t const&, tt::tt_metal::Tensor&)
E        --- void ttnn::device_operation::detail::launch_on_worker_thread<ttnn::operations::moreh::moreh_cumsum::MorehCumsumDeviceOperation, unsigned char, long, ttnn::operations::moreh::moreh_cumsum::MorehCumsumDeviceOperation::operation_attributes_t, ttnn::operations::moreh::moreh_cumsum::MorehCumsumDeviceOperation::tensor_args_t, tt::tt_metal::Tensor, tt::tt_metal::v0::Device*>(unsigned char, long, ttnn::operations::moreh::moreh_cumsum::MorehCumsumDeviceOperation::operation_attributes_t const&, ttnn::operations::moreh::moreh_cumsum::MorehCumsumDeviceOperation::tensor_args_t const&, tt::tt_metal::Tensor&, tt::tt_metal::v0::Device*&)
E        --- ttnn::operations::moreh::moreh_cumsum::MorehCumsumDeviceOperation::tensor_return_value_t ttnn::device_operation::detail::launch_on_single_device<ttnn::operations::moreh::moreh_cumsum::MorehCumsumDeviceOperation>(unsigned char, ttnn::operations::moreh::moreh_cumsum::MorehCumsumDeviceOperation::operation_attributes_t const&, ttnn::operations::moreh::moreh_cumsum::MorehCumsumDeviceOperation::tensor_args_t const&)
E        --- /home/ubuntu/tt_env/lib/python3.8/site-packages/ttnn/_ttnn.cpython-38-x86_64-linux-gnu.so(+0xfc90ea) [0x7f0e44c180ea]
E        --- ttnn::operations::moreh::moreh_cumsum::MorehCumsumDeviceOperation::tensor_return_value_t ttnn::device_operation::detail::invoke<ttnn::operations::moreh::moreh_cumsum::MorehCumsumDeviceOperation>(unsigned char, ttnn::operations::moreh::moreh_cumsum::MorehCumsumDeviceOperation::operation_attributes_t const&, ttnn::operations::moreh::moreh_cumsum::MorehCumsumDeviceOperation::tensor_args_t const&)
E        --- /home/ubuntu/tt_env/lib/python3.8/site-packages/ttnn/_ttnn.cpython-38-x86_64-linux-gnu.so(+0xfc88c4) [0x7f0e44c178c4]
E        --- ttnn::operations::moreh::moreh_cumsum::MorehCumsum::invoke(tt::tt_metal::Tensor const&, long, std::__1::optional<tt::tt_metal::Tensor> const&, std::__1::optional<tt::tt_metal::MemoryConfig> const&, std::__1::optional<std::__1::variant<ttnn::GrayskullComputeKernelConfig, ttnn::WormholeComputeKernelConfig>> const&)
E        --- /home/ubuntu/tt_env/bin/python3(_PyEval_EvalCodeWithName+0x26a) [0x55abda]

Reproduce

In pytorch2.0_ttnn, checkout to branch origin/fix-e2e-model/bloom

$ git checkout origin/fix-e2e-model/bloom -b fix-e2e-model/bloom
$ pytest -s tests/lowering/misc/test_cumsum.py --capture=no

cc @ayerofieiev-tt @jerrysky3 @jdh8

jerrysky3 commented 2 weeks ago

I suspect this is the case where ttnn.moreh_cumsum expects the input tensor needs to be 4D. So we will need to unsqueeze the input tensor to 4D first during conversion (not sure why it isn't done in ttnn library) then squeeze back the result. For example the input shape (1, 1, 1, 32) with dim=3 is accepted by the cumsum.

But I found that the output is still incorrect. For example with a tensor (1, 1, 1, 32) filled with 1.0, it outputs

TorchTensor([[[[1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
                1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]]]],
            dtype=torch.bfloat16)

while the expected output should be:

tensor([[[[ 1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9., 10., 11., 12., 13., 14.,
           15., 16., 17., 18., 19., 20., 21., 22., 23., 24., 25., 26., 27., 28.,
           29., 30., 31., 32.]]]], dtype=torch.bfloat16)

jerrysky3 commented 2 weeks ago

I see the problem. The current morch_cumsum doesn't support reduction other than first two dims (dim = 0, 1): msum_program_factory.cpp

Here we need dim = 3

ayerofieiev-tt commented 2 weeks ago

@jerrysky3 , noted. Will discuss with the maintainer and follow up. Do you meanwhile plan to overcome this with permute or will blocklist one input?

Also, I can't find where we need dim = 3, not reflected in https://github.com/tenstorrent/pytorch2.0_ttnn/blob/main/docs/operations/aten.cumsum.default.md

jerrysky3 commented 2 weeks ago

@jerrysky3 , noted. Will discuss with the maintainer and follow up. Do you meanwhile plan to overcome this with permute or will blocklist one input?

Also, I can't find where we need dim = 3, not reflected in https://github.com/tenstorrent/pytorch2.0_ttnn/blob/main/docs/operations/aten.cumsum.default.md

Thanks. The bloom needs cumsum on (1, 32) with dim = 1 and moreh_cumsum only works with 4D tensor. So ideally we can unsqueeze the tensor into (1, 1, 1, 32) and cumsum on dim = 3 and squeeze it back to (1, 32). Sorry I didn't explain clear enough in the comments.

I also filed the feature request https://github.com/tenstorrent/tt-metal/issues/14549 for cumsum on dim = 2, 3 (H, W). Meanwhile I think it's possible to unsqueeze the tensor into (1, 32, 1, 1) and cumsum on dim = 1 then squeeze back. It's not ideal because the tensor becomes 32x32 time larger (with tile layout) but might be a workaround for now (as these tensors are small)

ayerofieiev-tt commented 5 days ago

The maintainer won't be able to update the implementation so I see 3 options:

Proper implementation in op/kernel;
Workaround as described above but in TT-NN code2;
Workaround as described above in the compiler (in this code base);

@jerrysky3 wdyt?

jerrysky3 commented 5 days ago

The maintainer won't be able to update the implementation so I see 3 options:

Proper implementation in op/kernel;

Workaround as described above but in TT-NN code2;

Workaround as described above in the compiler (in this code base);

@jerrysky3 wdyt?

The workaround has been implmeneted in the compiler (this code base) and merged in https://github.com/tenstorrent/pytorch2.0_ttnn/pull/370. Personally I think it is fine for now and we should aim for a proper fix in op/kernel when there is a time cycle

tenstorrent / pytorch2.0_ttnn