Open chandrasekaranpradeep opened 1 month ago
@ntarafdar @sjameelTT can you please help me to find owners for this issue?
hey @sdjordjevicTT asking around, its a reduction op who doesn't have an owner , will ask ttnn ppl and get back to you
@sdjordjevicTT asked around and since there is no other owner for this, the TMs team will have to take this. We cannot get to this until end of next week.
Thanks @ntarafdar for picking this up. Great, I believe that should work for us.
moving to a P1 issue. @sdjordjevicTT pls comment if you believe the P0 is justified.
@nvukobratTT can comment more about priority, but I think this issue blocks Llama 3B bring-up on the Forge side.
moving to a P1 issue. @sdjordjevicTT pls comment if you believe the P0 is justified.
Confirming what @sdjordjevicTT mentioned, this one is a blocker for the Open Llama 3B model.
Additional details can be found on the MLIR issue as well:
Spoke to Jasmine, and @bbradelTT is for now taking over reductions. I'm reassigning this to him.
I tried to find out if there's any point at which there's a big drop off. Seemed like it might be somewhere between 1200 and 1400, but the PCC value goes up and down a fair amount:
#input_shape = (1, 12, 3200) # .72
#input_shape = (1, 12, 1600) # .76
#input_shape = (1, 12, 1400) # .78
#input_shape = (1, 12, 1376) # .81
#input_shape = (1, 12, 1363) # .93
#input_shape = (1, 12, 1369) # .83
#input_shape = (1, 12, 1368) # .85
#input_shape = (1, 12, 1367) # .71
#input_shape = (1, 12, 1366) # .96
#input_shape = (1, 12, 1350) # .95
#input_shape = (1, 12, 1344) # .87
#input_shape = (1, 12, 1300) # .91
#input_shape = (1, 12, 1200) # .93
#input_shape = (1, 12, 800) # .92
#input_shape = (1, 12, 320) # .99
Hi @bbradelTT do we have some updates regarding this missmatch problem?
@sdjordjevicTT Unfortunately we need to overhaul reduce. I won't have concrete updates for a while.
@sdjordjevicTT Unfortunately we need to overhaul reduce. I won't have concrete updates for a while.
@bbradelTT thanks for the details. Can you clarify the following:
To be certain that this issue is properly tracked, I'm re-adding the P0 label once again. Please correct me if I'm missing some context as to why this one should still be a P1 issue.
Thanks!
@nvukobratTT
Update for today:
Thanks for pushing this one further @bbradelTT! Much appreciated 🙌
Update for today:
Update for today:
- I spent today in meetings and on my other currently active P0 that I didn't have a chance to look at yesterday.
Thanks for the update and for letting us know Borys :))
It's valuable for us to know the state of the issues, and when we expect it to be resolved, so that we can plan accordingly on our side as well. Thanks once again!
Update for today:
Looking at different shapes:
Next steps: As far as I can tell, the floating point errors from reduce_tile are accumulative. Therefore the next step is to try to remove the scaling factor (leave it as 1) from reduce_tile and apply it afterwards at the end. I will need to analyze how that helps PCC. That will have a performance hit, but hopefully will improve PCC.
cc @rtawfik01 for LLK visibility
Removing the scaling factor and then calling ttnn::multiply resulted in the following.
With all ones, increased the accuracy:
However, with random inputs, lowered PCC:
expected_pytorch_result = tensor([[[0.4979],
[0.4969],
[0.5080],
[0.5029],
[0.5012],
[0.5046],
[0.4993],
[0.5034],
[0.5109],
[0.4984],
[0.4972],
[0.4963]]])
actual_pytorch_result = TorchTensor([[[0.4961],
[0.4980],
[0.5000],
[0.4961],
[0.5039]... [0.5039],
[0.5078],
[0.5000],
[0.5078],
[0.4961]]])
pcc = 0.9999
def assert_with_pcc(expected_pytorch_result, actual_pytorch_result, pcc=0.9999):
assert list(expected_pytorch_result.shape) == list(
actual_pytorch_result.shape
), f"list(expected_pytorch_result.shape)={list(expected_pytorch_result.shape)} vs list(actual_pytorch_result.shape)={list(actual_pytorch_result.shape)}"
pcc_passed, pcc_message = comp_pcc(expected_pytorch_result, actual_pytorch_result, pcc)
> assert pcc_passed, construct_pcc_assert_message(pcc_message, expected_pytorch_result, actual_pytorch_result)
E AssertionError: 0.3941431853119553
Will look at CBs to see what happened.
Looking at CBs, they look like the final output is integers. That lead me to think there is an incremental factor at play as well.
FAILED test_mean2.py::test_reduce[32] - AssertionError: 0.9996477335026296
FAILED test_mean2.py::test_reduce[50] - AssertionError: 0.9990662794549972
FAILED test_mean2.py::test_reduce[128] - AssertionError: 0.9983768209716934
FAILED test_mean2.py::test_reduce[200] - AssertionError: 0.9971359814011025
FAILED test_mean2.py::test_reduce[512] - AssertionError: 0.9640920022989882
FAILED test_mean2.py::test_reduce[800] - AssertionError: 0.9159602157408702
FAILED test_mean2.py::test_reduce[1600] - AssertionError: 0.8340448198726398
FAILED test_mean2.py::test_reduce[3200] - AssertionError: 0.3941431853119553
FAILED test_mean2.py::test_reduce[4096] - AssertionError: 0.752128086868872
There is a definite drop off, although powers of 2 have different behaviour.
Looking at the code, the mean reduce on the width (last dimension) is a matmul_tiles call, not a reduce_tile call.
I swapped dimensions, and verified that this uses reduce_tile, and that PCC behaviour is similar:
def test_reduce_h(device, width):
torch.manual_seed(0)
input_shape = (width, 12)
reduce_dim = 0
torch_input_tensor = torch.rand(input_shape, dtype=torch.float32)
torch_output_tensor = torch.mean(torch_input_tensor, dim=reduce_dim, keepdim=True, dtype=torch.float32)
input_tensor = ttnn.from_torch(torch_input_tensor, dtype=ttnn.float32, layout=ttnn.TILE_LAYOUT, device=device)
output_tensor = ttnn.mean(input_tensor, dim=reduce_dim)
output_tensor = ttnn.to_torch(output_tensor)
assert_with_pcc(torch_output_tensor, output_tensor)
resulted in
FAILED test_mean2.py::test_reduce_h[32] - AssertionError: 0.9998316111876924
FAILED test_mean2.py::test_reduce_h[50] - AssertionError: 0.999309887656723
FAILED test_mean2.py::test_reduce_h[128] - AssertionError: 0.9949648376563645
FAILED test_mean2.py::test_reduce_h[200] - AssertionError: 0.9940303932238288
FAILED test_mean2.py::test_reduce_h[512] - AssertionError: 0.9832944977683854
FAILED test_mean2.py::test_reduce_h[800] - AssertionError: 0.9470958968139239
FAILED test_mean2.py::test_reduce_h[1600] - AssertionError: 0.7357217229341707
FAILED test_mean2.py::test_reduce_h[3200] - AssertionError: 0.7696898216846108
FAILED test_mean2.py::test_reduce_h[4096] - AssertionError: 0.6391112138824813
I think what might be happening is that the logic is processing one tile at a time, accumulating the numbers.
On average each value would be around .5. By the time you get to a width of 1600, the sum would be 800. Adding .5 to 800 would result in more precision loss. That would also explain why the PCC went down faster when the multiplication is done at the end. The accumulated values are larger.
Next step: looking at reducing in more of an n*log(n)
manner in the pytest to see if that helps PCC.
I had to switch to reduce sum, which is independent of additional 0s being padded in because of fatals in padding.
PCCs follow a similar trend:
FAILED test_mean2.py::test_reduce[32] - AssertionError: 0.9996477335026296
FAILED test_mean2.py::test_reduce[50] - AssertionError: 0.9992098215952638
FAILED test_mean2.py::test_reduce[128] - AssertionError: 0.9983768209716934
FAILED test_mean2.py::test_reduce[200] - AssertionError: 0.9980993887635425
FAILED test_mean2.py::test_reduce[512] - AssertionError: 0.9640920022989882
FAILED test_mean2.py::test_reduce[800] - AssertionError: 0.9242553618875884
FAILED test_mean2.py::test_reduce[1600] - AssertionError: 0.8777743499163675
FAILED test_mean2.py::test_reduce[3200] - AssertionError: 0.3992921039572661
FAILED test_mean2.py::test_reduce[4096] - AssertionError: 0.752128086868872
FAILED test_mean2.py::test_reduce[32768] - AssertionError: 0.03179110579730596
Adding in chunks does help PCC
def test_reduce_part(device):
torch.manual_seed(0)
# 32, 1024, 32768
width = 1024
input_shape = (12, width)
reduce_dim = -1
n = 32
torch_input_tensors = [torch.rand(input_shape, dtype=torch.float32) for i in range(n)]
torch_input_tensor = torch.cat(torch_input_tensors, reduce_dim)
torch_output_tensor = torch.sum(torch_input_tensor, dim=reduce_dim, keepdim=True, dtype=torch.float32)
input_tensors = [ttnn.from_torch(torch_input_tensors[i], dtype=ttnn.float32, layout=ttnn.TILE_LAYOUT, device=device) for i in range(n)]
intermediate_tensors = [ttnn.sum(input_tensors[i], dim=reduce_dim) for i in range(n)]
intermediate_torch_tensors = [ttnn.to_torch(intermediate_tensors[i]) for i in range(n)]
intermediate_torch_tensor = torch.concat(intermediate_torch_tensors, reduce_dim)
intermediate_tensor = ttnn.from_torch(intermediate_torch_tensor, dtype=ttnn.float32, layout=ttnn.TILE_LAYOUT, device=device)
output_tensor = ttnn.sum(intermediate_tensor, dim=reduce_dim)
output_tensor = ttnn.to_torch(output_tensor)
assert_with_pcc(torch_output_tensor, output_tensor)
does improve PCC:
32768 (width = 1024)
FAILED test_mean2.py::test_reduce_part - AssertionError: 0.7728428816004724
8192 (width = 256)
FAILED test_mean2.py::test_reduce_part - AssertionError: 0.9844695593902897
I also looked online.
Loss of precision is a known problem for adding long sequences of floating point numbers, and there are algorithms to help with that: https://en.wikipedia.org/wiki/Kahan_summation_algorithm
Next step: I'll see how complex it would be to try to incorporate such an algorithm here.
Another algorithm is https://en.wikipedia.org/wiki/Pairwise_summation which is what I tried to approximate with the adjusted test.
I don't think there's anything in hardware that would help with accuracy. The pairwise summation algorithm is probably going to be easier to implement.
I think the easiest would be to change
output_tensor = tt::tt_metal::reduce(input_tensor, tt::tt_metal::ReduceOpMath::SUM...
calls in ttnn/cpp/ttnn/operations/reduction/generic/generic_reductions.cpp
for reduction along a single dimension to make sure that the scaling factor is 1 and then split the tensor via something like the below and then apply reduce across the sub-tensors.
y = torch.ones((1, 1, 1024, 8192), dtype=torch.float32)
y_tt = ttnn.from_torch(y, dtype=ttnn.bfloat16, layout=ttnn.TILE_LAYOUT, device=device)
a = ttnn.slice(y_tt, [0, 0, 0, 0], [1, 1, 1024, 4096])
b = ttnn.slice(y_tt, [0, 0, 0, 4096], [1, 1, 1024, 8192])
Note: The split has to occur at tile boundaries. Therefore you can't do the following:
Something like the following will not work:
z = torch.ones((1, 1, 50, 70), dtype=torch.float32)
z_tt = ttnn.from_torch(z, dtype=ttnn.bfloat16, layout=ttnn.TILE_LAYOUT, device=device)
c = ttnn.slice(z_tt, [0, 0, 0, 0], [1, 1, 50, 35])
d = ttnn.slice(z_tt, [0, 0, 0, 35], [1, 1, 50, 70])
You would need to split it at 32 or 64 instead of 35.
If the split is across two dimensions, then the PCCs will continue to be as they currently are. However, that would take longer and is probably best addressed in the future. This will take at least several days to implement. I'll continue daily updates, although they will be short.
@bbradelTT, thanks for pushing this further and keeping us posted! 🙌
Today's update:
Today's update:
Today's update:
Today's update:
Still one issue with pad where inputs are not multiples of tiles. However, I got the following PCCs for inputs of size (1, 1, 12, n) where n is passed in:
FAILED test_mean2.py::test_reduce[32] - AssertionError: 0.9996477335026296
FAILED test_mean2.py::test_reduce[50] - AssertionError: 0.9992098215952638
FAILED test_mean2.py::test_reduce[128] - AssertionError: 0.9990462253048942
FAILED test_mean2.py::test_reduce[512] - AssertionError: 0.9965267965250721
FAILED test_mean2.py::test_reduce[1024] - AssertionError: 0.9889842589831356
FAILED test_mean2.py::test_reduce[1600] - AssertionError: 0.9852341708352339
FAILED test_mean2.py::test_reduce[3200] - AssertionError: 0.9786824431459795
FAILED test_mean2.py::test_reduce[4096] - AssertionError: 0.963620489697217
FAILED test_mean2.py::test_reduce[32768] - AssertionError: 0.7968057244607337
Would this be sufficient?
Thanks for the update @bbradelTT!
@dgolubovicTT can you confirm if we need confirmation for other shapes besides the (1, 1, 12, n)
for the reduce mean?
For current shapes, PCC is okayish, but it'll be good to fix reduce ops as PCC issues are common on Forge models (other models are not P0 atm)..
@bbradelTT for these PCC issues to be generally fixed on the TTNN side, do you have a separate issue? Even if the fix for this case is currently OK, I assume we will hit many more of them pretty soon..
Thanks once again for helping us out with this one!
@nvukobratTT 12 probably comes from initial nuber of tokens in llama. @chandrasekaranpradeep correct me if I am wrong, but since number of input tokens can be any number less than max input tokens => we need ttnn.mean to work generally (not just 12).
@dgolubovicTT Thats Right The reduce op input shape is (1, 12, 3200) Here, 1 - refers to batch_size 12 -> refers to number of token in the sequences (It will vary any number less than max sequence length) 3200 -> embedding dimension (Fixed for open llama) As @dgolubovicTT mentioned, we need ttnn.mean to work generally
@chandrasekaranpradeep I would suggest you add test for this on our side that will cover all input shapes from (1,1,3200) to (1,max_num_tokens, 3200). Therefore, we will know for sure what shapes fail. This is overlapping with unique ops testing, and I think this is a good example for how to use it. We have to support all ops in llama that have variable sequence length for all sequence lengths, right? So we should add those tests covering all sequence lengths as soon as possible, for two reasons:
I tried running reduce mean test for shape (1,32,3200) dim = -1, and it also gives data mismatch. Since I generated input using torch.rand mean should be around 0.5 for all. That is the case in torch output:
fw_out is tensor([[[0.4979],
[0.4969],
[0.5080],
[0.5029],
[0.5012],
[0.5046],
[0.4993],
[0.5034],
[0.5109],
[0.4984],
[0.4972],
[0.4963],
[0.5007],
[0.5020],
[0.4910],
[0.4976],
[0.5074],
[0.4978],
[0.4947],
[0.5001],
[0.4986],
[0.4956],
[0.5075],
[0.4962],
[0.4956],
[0.5050],
[0.5009],
[0.5055],
[0.5023],
[0.5025],
[0.4936],
[0.4976]]])
However, ttnn output gives consistently less than 0.5 for all 32:
co_out is tensor([[[0.4648],
[0.4707],
[0.4844],
[0.4727],
[0.4570],
[0.4766],
[0.4727],
[0.4785],
[0.4883],
[0.4707],
[0.4766],
[0.4648],
[0.4668],
[0.4668],
[0.4629],
[0.4648],
[0.4688],
[0.4688],
[0.4590],
[0.4805],
[0.4766],
[0.4668],
[0.4824],
[0.4648],
[0.4668],
[0.4844],
[0.4746],
[0.4766],
[0.4668],
[0.4707],
[0.4668],
[0.4766]]])
Interestingly I tried this for shapes (1,32,x) where x is going from 10 - 400 (excluded) with step 10.
These shapes are failing with data mismatch:
(1,32,210)
(1,32,220)
(1,32,250)
(1,32,330)
(1,32,340)
(1,32,350)
(1,32,360)
(1,32,370)
(1,32,380)
(1,32,390)
For ones that are less than 210 there is no data mismatch.
Since inputs are from torch.rand (which is Uniform distribution [0,1)) it is really odd that mean over 3200 values is 0.46-0.48. It seems that mean doesn't work on bigger dimensions at all...
Today's update:
Hey @bbradelTT, I've talked to @nvukobratTT offline to scope requirements for this issue.
For the first phase we need ttnn.mean to work for shapes (1,12,3200), (1,128,3200) and (1,1,3200). Fixing this is P0.
For the second phase, we will need ttnn.mean to work for all shapes (1,x,3200) for x in [1,2048]. Only then, we can get Llama inference for any number of tokens on input. This second phase can be P1.
So after you finish this first phase, we will move this issue to P1, and go from there.
Describe the bug The ttnn.mean throws assertion error because of data mismatch between PyTorch and TTNN output and the pcc is dropped to 0.72 when the input tensor of (1, 12, 3200) and dim = -1 is passed to ttnn.mean op. For more context, here is the exact error message
To Reproduce
Run the following test:
Expected behavior The data mismatch between PyTorch and TTNN output should be resolved.