tenstorrent / tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.
Apache License 2.0
454 stars 67 forks source link

Need for Tensor class to have a rank of original Tensor #5335

Closed seunghwan100 closed 5 months ago

seunghwan100 commented 8 months ago

In tt-metal, all tensors must currently be aligned to a 4D shape. For example, if the original shape of the tensor is (4, 8, 32), it should be transformed into (1, 4, 8, 32) to be compatible with tt-tensor.

In GroupNorm, both input and output can be of shape (N, C, ), where can any dimensions(>=0).

groupnorm

So, considering the case where the tt-tensor shape is (1, 4, 4, 4) and num_groups is 4, If the original tensor is 4D, then N=1 and C=4. If it's 3D, then N=4 and C=4. Similarly, if the tt-tensor shape is (1, 1, 4, 4) and num_groups is 4, If the original tensor is 3D, then N=1 and C=4. If it's 2D, then N=4 and C=4. In cases of 5D or higher, N and C cannot be determined.(This problem needs to be resolved in another issue.)

In other words, the rank of the original Tensor cannot be inferred from the shape of tt-tensor and num_groups argument. Therefore, the Tensor class should have information about the rank of the original Tensor.

arakhmati commented 8 months ago

Hey guys, this is currently supported from our python API. So, at the very least we know we can do it.

There are some constraints such as:

We are currently in the process of migrating this logic into C++

@chekangliang what is the priority of this task?

razorback3 commented 8 months ago

@arakhmati Yes. We need this feature in the C++ level. I assume this has pretty high priority because it is required for GroupNorm which is used in GPT-2 training.

arakhmati commented 8 months ago

@razorback3 can you please give me a list of operations where you need this supported right now?

Obviously, group norm but I am assuming the operations feeding it and consuming from it also need to be updated. So, I would like to have a list to unblock you right away

jliangTT commented 8 months ago

@arakhmati , the priority is P1, but we need to scope this properly in term of ops.

@razorback3, can you please answer the questions from @arakhmati?

So, I would like to have a list of [ops] to unblock you right away

razorback3 commented 8 months ago

@arakhmati @jliangTT Currently, to the best of my knowledge, groupnorm is the only OP that requires this feature.

However, I am hard to understand why you are asking for the list of ops that require this feature. Can you explain more about how you are going to support this feature? Then, I can give you more information that you would need. I first thought that the change could be easily made in the Tensor class to store user-given shape information.

arakhmati commented 8 months ago

@arakhmati @jliangTT Currently, to the best of my knowledge, groupnorm is the only OP that requires this feature.

However, I am hard to understand why you are asking for the list of ops that require this feature. Can you explain more about how you are going to support this feature? Then, I can give you more information that you would need. I first thought that the change could be easily made in the Tensor class to store user-given shape information.

@razorback3 the Shape class can support shapes of rank 1 to 8 already. We just never updated the ops to be N-D in C++

But as of now you can use this constructor: Shape(const std::vector<uint32_t>& dimensions)

So the feature is there. We just need to update ops. We will eventually update all of them.

But what I am asking is what ops do you need updated right away?

razorback3 commented 8 months ago

OK. Thanks @arakhmati. Let me summarize.

Shape class in tt-eager supports 8-D and its constructor allows a dimension size less than 8-D. This is good. However, some host functions assume the tensor is always 4-D.

For example, the below code causes a runtime error because the input shape is 3-D. In order to make the below code work, the user needs to input the shape as {1, 3, 32, 32}.

Shape shape = {3, 32, 32};
Tensor a = tt::numpy::random::random(shape);
a = a.to(Layout::TILE); // <----- ERROR
a = a.to(device);

So, I think allowing to(...) utility functions to work for non-4-D tensor will solve all the problems. Hope this information is enough for you.

Thanks.

seunghwan100 commented 8 months ago

convert_layout_row_major_to_tile, convert_layout_tile_to_row_major, Tensor::pad_to_tile and Tensor::unpad_from_tile assume that the tensor shape is 4D.

arakhmati commented 8 months ago

OK. Thanks @arakhmati. Let me summarize.

Shape class in tt-eager supports 8-D and its constructor allows a dimension size less than 8-D. This is good. However, some host functions assume the tensor is always 4-D.

For example, the below code causes a runtime error because the input shape is 3-D. In order to make the below code work, the user needs to input the shape as {1, 3, 32, 32}.

Shape shape = {3, 32, 32};
Tensor a = tt::numpy::random::random(shape);
a = a.to(Layout::TILE); // <----- ERROR
a = a.to(device);

So, I think allowing to(...) utility functions to work for non-4-D tensor will solve all the problems. Hope this information is enough for you.

Thanks.

Yes, that's exactly right. We simply need to update the logic in the functions. The types already support N-D.

FYI, we support negative indexing into the shape like in python. i.e.: shape[-1] gives you the last dimension.

seunghwan100 commented 8 months ago

We simply need to update the logic in the functions.

Do you have a plan for this?

arakhmati commented 8 months ago

We simply need to update the logic in the functions.

Do you have a plan for this?

It's on the roadmap but we don't have concrete dates for this yet

seunghwan100 commented 7 months ago

I created a issue https://github.com/tenstorrent-metal/tt-metal/issues/6480. So, I closed this issue.

jliangTT commented 7 months ago

This was marked as blocking groupNorm. Are you unblocked to implemented groupNorm?

razorback3 commented 7 months ago

The problem is summarized as follow: https://github.com/tenstorrent-metal/tt-metal/issues/5335#issuecomment-1959126844

and the answer from Akhmed: https://github.com/tenstorrent-metal/tt-metal/issues/5335#issuecomment-1965307078

Answer to the question from Jason:

This was marked as blocking groupNorm. Are you unblocked to implemented groupNorm?

No. We are waiting for the implementation that Akhmed mentioned above.

jliangTT commented 7 months ago

@razorback3 yes that is my understanding here. It confused me when this got closed.

jliangTT commented 6 months ago

@razorback3 , can you add a unit test that @arakhmati can implement against?

razorback3 commented 6 months ago

https://github.com/tenstorrent-metal/tt-metal/tree/jaehoon/test-tensor-ranks Here is the branch that contains test code (@jliangTT, @arakhmati).

jliangTT commented 6 months ago

assigning to @arakhmati as discussed in our previous conservation now that we have the repro. what is the t-shirt size? can we turn this around fairly fast Akhmed?

jliangTT commented 5 months ago

promiting this to p0 following a discussion with @davorchap / @jvasilje . @arakhmati , can you prioritize accordingly?

arakhmati commented 5 months ago

promiting this to p0 following a discussion with @davorchap / @jvasilje . @arakhmati , can you prioritize accordingly?

Yes, will do

arakhmati commented 5 months ago

PR with the fixes

arakhmati commented 5 months ago

I merged to main. pad_to_tile now works for any rank between 1 and 8 (inclusive on both ends). I am going to close this issue

razorback3 commented 5 months ago

Thanks @arakhmati. I can now make 5D~8D tensors in C++ level, but I encountered error in python code.

import torch
import tt_lib
import tt_lib as ttl

torch.manual_seed(10)

shape = (3,2,2,32,32)
x = torch.randint(low=0, high=4, size=shape).to(torch.bfloat16)

dev_x = ttl.tensor.Tensor(x, ttl.tensor.DataType.BFLOAT16).to(ttl.tensor.Layout.TILE).to(device)

E       RuntimeError: TT_THROW @ tt_eager/tensor/tensor_impl.hpp:177: tt::exception
E       info:
E       Rank {} is not supported!
E       5
jliangTT commented 5 months ago

Ok. i synced with @arakhmati offline on this and i think the initial scope was a little fuzzy, and misalignment happened.

I feel like we have gap in the scope between what is written vs. what is expected. Help us going forward, we probably need some explict enumeration of the functions that require supports.

In this case, can Moreh open up another issue to enumerate the methods so we can knock them out w/t unit test? I can continue to prioritize them as p0 if necessary.

seunghwan100 commented 5 months ago

There is a issue.

https://github.com/tenstorrent/tt-metal/issues/6480

jliangTT commented 5 months ago

@seunghwan100 , sorry i read through the issue and it is not sufficiently descriptive to help us capture the detailed requirement. It is missing information that we have requested in my last message, i.e. can Moreh open up another issue to enumerate the methods so we can knock them out w/t unit test?

I am sure it is as frustrating for Moreh to receive deliverables that don't meet your need as it is for TT who deliver on this without getting clear requirement. Getting better at writing explicit requirement is an area that will help the partnership getting better at serving each other.

@razorback3.

razorback3 commented 5 months ago

Absolutely no problem. I understand the situation. Will open a new issue soon. Thanks.

@jliangTT @arakhmati

razorback3 commented 5 months ago

Opened a new issue with more concrete requirements and unit tests: https://github.com/tenstorrent/tt-metal/issues/8149