[Bug Report] ttnn.mean op - Data Mismatch

chandrasekaranpradeep commented 1 month ago

Describe the bug The ttnn.mean throws assertion error because of data mismatch between PyTorch and TTNN output and the pcc is dropped to 0.72 when the input tensor of (1, 12, 3200) and dim = -1 is passed to ttnn.mean op. For more context, here is the exact error message

def assert_with_pcc(expected_pytorch_result, actual_pytorch_result, pcc=0.9999):
        assert list(expected_pytorch_result.shape) == list(
            actual_pytorch_result.shape
        ), f"list(expected_pytorch_result.shape)={list(expected_pytorch_result.shape)} vs list(actual_pytorch_result.shape)={list(actual_pytorch_result.shape)}"
        pcc_passed, pcc_message = comp_pcc(expected_pytorch_result, actual_pytorch_result, pcc)
>       assert pcc_passed, construct_pcc_assert_message(pcc_message, expected_pytorch_result, actual_pytorch_result)
E       AssertionError: 0.7203957195745748

To Reproduce

Run the following test:

import torch
import ttnn
from tests.ttnn.utils_for_testing import assert_with_pcc
from models.utility_functions import torch_random
def test_mean_pcc_issue(device):
    torch.manual_seed(0)

    input_shape = (1, 12, 3200)
    reduce_dim = -1

    torch_input_tensor = torch.rand(input_shape, dtype=torch.float32)
    torch_output_tensor = torch.mean(torch_input_tensor, dim=reduce_dim, keepdim=True, dtype=torch.float32)

    input_tensor = ttnn.from_torch(torch_input_tensor, dtype=ttnn.float32, layout=ttnn.TILE_LAYOUT, device=device)

    output_tensor = ttnn.mean(input_tensor, dim=reduce_dim)
    output_tensor = ttnn.to_torch(output_tensor)

    assert_with_pcc(torch_output_tensor, output_tensor)

Expected behavior The data mismatch between PyTorch and TTNN output should be resolved.

sdjordjevicTT commented 1 month ago

@ntarafdar @sjameelTT can you please help me to find owners for this issue?

ntarafdar commented 1 month ago

hey @sdjordjevicTT asking around, its a reduction op who doesn't have an owner , will ask ttnn ppl and get back to you

ntarafdar commented 1 month ago

@sdjordjevicTT asked around and since there is no other owner for this, the TMs team will have to take this. We cannot get to this until end of next week.

sdjordjevicTT commented 1 month ago

Thanks @ntarafdar for picking this up. Great, I believe that should work for us.

jvasilje commented 4 weeks ago

moving to a P1 issue. @sdjordjevicTT pls comment if you believe the P0 is justified.

sdjordjevicTT commented 4 weeks ago

@nvukobratTT can comment more about priority, but I think this issue blocks Llama 3B bring-up on the Forge side.

nvukobratTT commented 4 weeks ago

moving to a P1 issue. @sdjordjevicTT pls comment if you believe the P0 is justified.

Confirming what @sdjordjevicTT mentioned, this one is a blocker for the Open Llama 3B model.

Additional details can be found on the MLIR issue as well:

https://github.com/tenstorrent/tt-mlir/issues/869

ntarafdar commented 3 weeks ago

Spoke to Jasmine, and @bbradelTT is for now taking over reductions. I'm reassigning this to him.

bbradelTT commented 3 weeks ago

I tried to find out if there's any point at which there's a big drop off. Seemed like it might be somewhere between 1200 and 1400, but the PCC value goes up and down a fair amount:

    #input_shape = (1, 12, 3200) # .72
    #input_shape = (1, 12, 1600) # .76
    #input_shape = (1, 12, 1400) # .78
    #input_shape = (1, 12, 1376) # .81
    #input_shape = (1, 12, 1363) # .93
    #input_shape = (1, 12, 1369) # .83
    #input_shape = (1, 12, 1368) # .85
    #input_shape = (1, 12, 1367) # .71
    #input_shape = (1, 12, 1366) # .96
    #input_shape = (1, 12, 1350) # .95
    #input_shape = (1, 12, 1344) # .87
    #input_shape = (1, 12, 1300) # .91
    #input_shape = (1, 12, 1200) # .93
    #input_shape = (1, 12, 800) # .92
    #input_shape = (1, 12, 320) # .99

sdjordjevicTT commented 2 weeks ago

Hi @bbradelTT do we have some updates regarding this missmatch problem?

bbradelTT commented 2 weeks ago

@sdjordjevicTT Unfortunately we need to overhaul reduce. I won't have concrete updates for a while.

nvukobratTT commented 2 weeks ago

@sdjordjevicTT Unfortunately we need to overhaul reduce. I won't have concrete updates for a while.

@bbradelTT thanks for the details. Can you clarify the following:

What are the core issues with reduced mean and related PCC issues?
- Having this info, we might be able to work around it until a fix is in place
Details around lowering the priority on this issue.
- From the current standpoint, this issue should be treated as P0 as it blocks one of the Forge core models, Llama.

To be certain that this issue is properly tracked, I'm re-adding the P0 label once again. Please correct me if I'm missing some context as to why this one should still be a P1 issue.

Thanks!

bbradelTT commented 2 weeks ago

@nvukobratTT

At a high level, depending on the inputs, a lot of different things are done (including transpose, auto format, and reshape) that don't work that well when different dimensions are padded, and padding may not be done properly in all instances (see https://github.com/tenstorrent/tt-metal/issues/12662 and https://github.com/tenstorrent/tt-metal/issues/13647). I'm not sure of the root cause of the reduced mean in this case, although in one of the other issues it seems to be because the entire tile is used and therefore there are extra 0s and the denominator is larger than it should be. Having said that, I still haven't isolated the exact issue for this specific scenario.
I don't have context for the priority.

Update for today:

I'm working on printing out the data in the circular buffers to see where any discrepancies are occurring. Will continue that work tomorrow.

nvukobratTT commented 2 weeks ago

Thanks for pushing this one further @bbradelTT! Much appreciated 🙌

bbradelTT commented 2 weeks ago

Update for today:

I spent today in meetings and on my other currently active P0 that I didn't have a chance to look at yesterday.

nvukobratTT commented 2 weeks ago

Update for today:

I spent today in meetings and on my other currently active P0 that I didn't have a chance to look at yesterday.

Thanks for the update and for letting us know Borys :))

It's valuable for us to know the state of the issues, and when we expect it to be resolved, so that we can plan accordingly on our side as well. Thanks once again!

bbradelTT commented 1 week ago

Update for today:

instrumented compute kernel to print out input CBs and output CB and used all ones as input.
Used ones instead of random values.
There is a scalar factor that is 1/n. When n is not a power of 2, there is error introduced.
The reduce_tile operation is called for each tile of the reduced operation, multiplying the error.

Looking at different shapes:

1,1,12,32 scalar is exact, output is 1
1,1,12,800 scalar is 0.00124359 vs .00125 (.6% off), output is 0.988281
1,1,12,3200 scalar is 0.000310898 vs 0.0003125 (.5% off), output is 0.917969
1,1,12,2048 scalar is exact, output is 1

Next steps: As far as I can tell, the floating point errors from reduce_tile are accumulative. Therefore the next step is to try to remove the scaling factor (leave it as 1) from reduce_tile and apply it afterwards at the end. I will need to analyze how that helps PCC. That will have a performance hit, but hopefully will improve PCC.

cc @rtawfik01 for LLK visibility

bbradelTT commented 1 week ago

Removing the scaling factor and then calling ttnn::multiply resulted in the following.

With all ones, increased the accuracy:

1,1,12,800 output is 0.9961
1,1,12,3200 output is 0.9961
1,1,12,2048 output is still 1

However, with random inputs, lowered PCC:

expected_pytorch_result = tensor([[[0.4979],
         [0.4969],
         [0.5080],
         [0.5029],
         [0.5012],
         [0.5046],
         [0.4993],
         [0.5034],
         [0.5109],
         [0.4984],
         [0.4972],
         [0.4963]]])
actual_pytorch_result = TorchTensor([[[0.4961],
              [0.4980],
              [0.5000],
              [0.4961],
              [0.5039]...            [0.5039],
              [0.5078],
              [0.5000],
              [0.5078],
              [0.4961]]])
pcc = 0.9999

    def assert_with_pcc(expected_pytorch_result, actual_pytorch_result, pcc=0.9999):
        assert list(expected_pytorch_result.shape) == list(
            actual_pytorch_result.shape
        ), f"list(expected_pytorch_result.shape)={list(expected_pytorch_result.shape)} vs list(actual_pytorch_result.shape)={list(actual_pytorch_result.shape)}"
        pcc_passed, pcc_message = comp_pcc(expected_pytorch_result, actual_pytorch_result, pcc)
>       assert pcc_passed, construct_pcc_assert_message(pcc_message, expected_pytorch_result, actual_pytorch_result)
E       AssertionError: 0.3941431853119553

Will look at CBs to see what happened.

bbradelTT commented 1 week ago

Looking at CBs, they look like the final output is integers. That lead me to think there is an incremental factor at play as well.

FAILED test_mean2.py::test_reduce[32] - AssertionError: 0.9996477335026296
FAILED test_mean2.py::test_reduce[50] - AssertionError: 0.9990662794549972
FAILED test_mean2.py::test_reduce[128] - AssertionError: 0.9983768209716934
FAILED test_mean2.py::test_reduce[200] - AssertionError: 0.9971359814011025
FAILED test_mean2.py::test_reduce[512] - AssertionError: 0.9640920022989882
FAILED test_mean2.py::test_reduce[800] - AssertionError: 0.9159602157408702
FAILED test_mean2.py::test_reduce[1600] - AssertionError: 0.8340448198726398
FAILED test_mean2.py::test_reduce[3200] - AssertionError: 0.3941431853119553
FAILED test_mean2.py::test_reduce[4096] - AssertionError: 0.752128086868872

There is a definite drop off, although powers of 2 have different behaviour.

Looking at the code, the mean reduce on the width (last dimension) is a matmul_tiles call, not a reduce_tile call.

I swapped dimensions, and verified that this uses reduce_tile, and that PCC behaviour is similar:

def test_reduce_h(device, width):
    torch.manual_seed(0) 

    input_shape = (width, 12)
    reduce_dim = 0

    torch_input_tensor = torch.rand(input_shape, dtype=torch.float32)
    torch_output_tensor = torch.mean(torch_input_tensor, dim=reduce_dim, keepdim=True, dtype=torch.float32)

    input_tensor = ttnn.from_torch(torch_input_tensor, dtype=ttnn.float32, layout=ttnn.TILE_LAYOUT, device=device)

    output_tensor = ttnn.mean(input_tensor, dim=reduce_dim)
    output_tensor = ttnn.to_torch(output_tensor)

    assert_with_pcc(torch_output_tensor, output_tensor)

resulted in

FAILED test_mean2.py::test_reduce_h[32] - AssertionError: 0.9998316111876924
FAILED test_mean2.py::test_reduce_h[50] - AssertionError: 0.999309887656723
FAILED test_mean2.py::test_reduce_h[128] - AssertionError: 0.9949648376563645
FAILED test_mean2.py::test_reduce_h[200] - AssertionError: 0.9940303932238288
FAILED test_mean2.py::test_reduce_h[512] - AssertionError: 0.9832944977683854
FAILED test_mean2.py::test_reduce_h[800] - AssertionError: 0.9470958968139239
FAILED test_mean2.py::test_reduce_h[1600] - AssertionError: 0.7357217229341707
FAILED test_mean2.py::test_reduce_h[3200] - AssertionError: 0.7696898216846108
FAILED test_mean2.py::test_reduce_h[4096] - AssertionError: 0.6391112138824813

I think what might be happening is that the logic is processing one tile at a time, accumulating the numbers.

On average each value would be around .5. By the time you get to a width of 1600, the sum would be 800. Adding .5 to 800 would result in more precision loss. That would also explain why the PCC went down faster when the multiplication is done at the end. The accumulated values are larger.

Next step: looking at reducing in more of an n*log(n) manner in the pytest to see if that helps PCC.

bbradelTT commented 1 week ago

I had to switch to reduce sum, which is independent of additional 0s being padded in because of fatals in padding.

PCCs follow a similar trend:

FAILED test_mean2.py::test_reduce[32] - AssertionError: 0.9996477335026296
FAILED test_mean2.py::test_reduce[50] - AssertionError: 0.9992098215952638
FAILED test_mean2.py::test_reduce[128] - AssertionError: 0.9983768209716934
FAILED test_mean2.py::test_reduce[200] - AssertionError: 0.9980993887635425
FAILED test_mean2.py::test_reduce[512] - AssertionError: 0.9640920022989882
FAILED test_mean2.py::test_reduce[800] - AssertionError: 0.9242553618875884
FAILED test_mean2.py::test_reduce[1600] - AssertionError: 0.8777743499163675
FAILED test_mean2.py::test_reduce[3200] - AssertionError: 0.3992921039572661
FAILED test_mean2.py::test_reduce[4096] - AssertionError: 0.752128086868872
FAILED test_mean2.py::test_reduce[32768] - AssertionError: 0.03179110579730596

Adding in chunks does help PCC

def test_reduce_part(device):
    torch.manual_seed(0)

    # 32, 1024, 32768
    width = 1024
    input_shape = (12, width)
    reduce_dim = -1

    n = 32
    torch_input_tensors = [torch.rand(input_shape, dtype=torch.float32) for i in range(n)]
    torch_input_tensor = torch.cat(torch_input_tensors, reduce_dim)

    torch_output_tensor = torch.sum(torch_input_tensor, dim=reduce_dim, keepdim=True, dtype=torch.float32)

    input_tensors = [ttnn.from_torch(torch_input_tensors[i], dtype=ttnn.float32, layout=ttnn.TILE_LAYOUT, device=device) for i in range(n)]

    intermediate_tensors = [ttnn.sum(input_tensors[i], dim=reduce_dim) for i in range(n)]
    intermediate_torch_tensors  = [ttnn.to_torch(intermediate_tensors[i]) for i in range(n)]
    intermediate_torch_tensor = torch.concat(intermediate_torch_tensors, reduce_dim)
    intermediate_tensor = ttnn.from_torch(intermediate_torch_tensor, dtype=ttnn.float32, layout=ttnn.TILE_LAYOUT, device=device)
    output_tensor = ttnn.sum(intermediate_tensor, dim=reduce_dim)
    output_tensor = ttnn.to_torch(output_tensor)

    assert_with_pcc(torch_output_tensor, output_tensor)

does improve PCC:

32768 (width = 1024)
FAILED test_mean2.py::test_reduce_part - AssertionError: 0.7728428816004724
8192 (width = 256)
FAILED test_mean2.py::test_reduce_part - AssertionError: 0.9844695593902897

I also looked online.

Loss of precision is a known problem for adding long sequences of floating point numbers, and there are algorithms to help with that: https://en.wikipedia.org/wiki/Kahan_summation_algorithm

Next step: I'll see how complex it would be to try to incorporate such an algorithm here.

bbradelTT commented 1 week ago

Another algorithm is https://en.wikipedia.org/wiki/Pairwise_summation which is what I tried to approximate with the adjusted test.

bbradelTT commented 1 week ago

I don't think there's anything in hardware that would help with accuracy. The pairwise summation algorithm is probably going to be easier to implement.

I think the easiest would be to change output_tensor = tt::tt_metal::reduce(input_tensor, tt::tt_metal::ReduceOpMath::SUM... calls in ttnn/cpp/ttnn/operations/reduction/generic/generic_reductions.cpp

for reduction along a single dimension to make sure that the scaling factor is 1 and then split the tensor via something like the below and then apply reduce across the sub-tensors.

    y = torch.ones((1, 1, 1024, 8192), dtype=torch.float32)
    y_tt = ttnn.from_torch(y, dtype=ttnn.bfloat16, layout=ttnn.TILE_LAYOUT, device=device)
    a = ttnn.slice(y_tt, [0, 0, 0, 0], [1, 1, 1024, 4096])
    b = ttnn.slice(y_tt, [0, 0, 0, 4096], [1, 1, 1024, 8192])

Note: The split has to occur at tile boundaries. Therefore you can't do the following:


Something like the following will not work:

z = torch.ones((1, 1, 50, 70), dtype=torch.float32)
z_tt = ttnn.from_torch(z, dtype=ttnn.bfloat16, layout=ttnn.TILE_LAYOUT, device=device)
c = ttnn.slice(z_tt, [0, 0, 0, 0], [1, 1, 50, 35])
d = ttnn.slice(z_tt, [0, 0, 0, 35], [1, 1, 50, 70])



You would need to split it at 32 or 64 instead of 35.

If the split is across two dimensions, then the PCCs will continue to be as they currently are. However, that would take longer and is probably best addressed in the future. This will take at least several days to implement. I'll continue daily updates, although they will be short.

nvukobratTT commented 1 week ago

@bbradelTT, thanks for pushing this further and keeping us posted! 🙌

bbradelTT commented 1 week ago

Today's update:

I was working on other high priority work
should have some time tomorrow to work on this

bbradelTT commented 1 week ago

Today's update:

I thought #13647 was related and looked into how the two are related. Turns out there were unrelated issues, and I created a PR. Will continue to work on this tomorrow.

bbradelTT commented 6 days ago

Today's update:

started implementation
should be able to get an initial PR out early next week

bbradelTT commented 3 days ago

Today's update:

initial implementation is done. PCC goes up to .978 for size of 3200
there are still some errors that I need to debug, and I don't know how much higher the PCC can go

bbradelTT commented 2 days ago

Still one issue with pad where inputs are not multiples of tiles. However, I got the following PCCs for inputs of size (1, 1, 12, n) where n is passed in:

FAILED test_mean2.py::test_reduce[32] - AssertionError: 0.9996477335026296
FAILED test_mean2.py::test_reduce[50] - AssertionError: 0.9992098215952638
FAILED test_mean2.py::test_reduce[128] - AssertionError: 0.9990462253048942
FAILED test_mean2.py::test_reduce[512] - AssertionError: 0.9965267965250721
FAILED test_mean2.py::test_reduce[1024] - AssertionError: 0.9889842589831356
FAILED test_mean2.py::test_reduce[1600] - AssertionError: 0.9852341708352339
FAILED test_mean2.py::test_reduce[3200] - AssertionError: 0.9786824431459795
FAILED test_mean2.py::test_reduce[4096] - AssertionError: 0.963620489697217
FAILED test_mean2.py::test_reduce[32768] - AssertionError: 0.7968057244607337

Would this be sufficient?

nvukobratTT commented 1 day ago

Thanks for the update @bbradelTT!

@dgolubovicTT can you confirm if we need confirmation for other shapes besides the (1, 1, 12, n) for the reduce mean?

For current shapes, PCC is okayish, but it'll be good to fix reduce ops as PCC issues are common on Forge models (other models are not P0 atm)..

@bbradelTT for these PCC issues to be generally fixed on the TTNN side, do you have a separate issue? Even if the fix for this case is currently OK, I assume we will hit many more of them pretty soon..

Thanks once again for helping us out with this one!

dgolubovicTT commented 1 day ago

@nvukobratTT 12 probably comes from initial nuber of tokens in llama. @chandrasekaranpradeep correct me if I am wrong, but since number of input tokens can be any number less than max input tokens => we need ttnn.mean to work generally (not just 12).

chandrasekaranpradeep commented 1 day ago

@dgolubovicTT Thats Right The reduce op input shape is (1, 12, 3200) Here, 1 - refers to batch_size 12 -> refers to number of token in the sequences (It will vary any number less than max sequence length) 3200 -> embedding dimension (Fixed for open llama) As @dgolubovicTT mentioned, we need ttnn.mean to work generally

dgolubovicTT commented 1 day ago

@chandrasekaranpradeep I would suggest you add test for this on our side that will cover all input shapes from (1,1,3200) to (1,max_num_tokens, 3200). Therefore, we will know for sure what shapes fail. This is overlapping with unique ops testing, and I think this is a good example for how to use it. We have to support all ops in llama that have variable sequence length for all sequence lengths, right? So we should add those tests covering all sequence lengths as soon as possible, for two reasons:

Prevent regressions (catch them early)
See how far we are from generality on each ttnn op with respect to sequence length. What are your thoughts @chandrasekaranpradeep @nvukobratTT. We can resume this topic offline, to avoid overloading this thread...

dgolubovicTT commented 1 day ago

I tried running reduce mean test for shape (1,32,3200) dim = -1, and it also gives data mismatch. Since I generated input using torch.rand mean should be around 0.5 for all. That is the case in torch output:

fw_out is  tensor([[[0.4979],
         [0.4969],
         [0.5080],
         [0.5029],
         [0.5012],
         [0.5046],
         [0.4993],
         [0.5034],
         [0.5109],
         [0.4984],
         [0.4972],
         [0.4963],
         [0.5007],
         [0.5020],
         [0.4910],
         [0.4976],
         [0.5074],
         [0.4978],
         [0.4947],
         [0.5001],
         [0.4986],
         [0.4956],
         [0.5075],
         [0.4962],
         [0.4956],
         [0.5050],
         [0.5009],
         [0.5055],
         [0.5023],
         [0.5025],
         [0.4936],
         [0.4976]]])

However, ttnn output gives consistently less than 0.5 for all 32:

co_out is  tensor([[[0.4648],
         [0.4707],
         [0.4844],
         [0.4727],
         [0.4570],
         [0.4766],
         [0.4727],
         [0.4785],
         [0.4883],
         [0.4707],
         [0.4766],
         [0.4648],
         [0.4668],
         [0.4668],
         [0.4629],
         [0.4648],
         [0.4688],
         [0.4688],
         [0.4590],
         [0.4805],
         [0.4766],
         [0.4668],
         [0.4824],
         [0.4648],
         [0.4668],
         [0.4844],
         [0.4746],
         [0.4766],
         [0.4668],
         [0.4707],
         [0.4668],
         [0.4766]]])

Interestingly I tried this for shapes (1,32,x) where x is going from 10 - 400 (excluded) with step 10.

These shapes are failing with data mismatch:

(1,32,210)
(1,32,220)
(1,32,250)
(1,32,330)
(1,32,340)
(1,32,350)
(1,32,360)
(1,32,370)
(1,32,380)
(1,32,390)

For ones that are less than 210 there is no data mismatch.

dgolubovicTT commented 1 day ago

Since inputs are from torch.rand (which is Uniform distribution [0,1)) it is really odd that mean over 3200 values is 0.46-0.48. It seems that mean doesn't work on bigger dimensions at all...

bbradelTT commented 1 day ago

Today's update:

still blocked by pad. Created https://github.com/tenstorrent/tt-metal/issues/15511 to track fixing pad

dgolubovicTT commented 1 day ago

Hey @bbradelTT, I've talked to @nvukobratTT offline to scope requirements for this issue.

For the first phase we need ttnn.mean to work for shapes (1,12,3200), (1,128,3200) and (1,1,3200). Fixing this is P0.

For the second phase, we will need ttnn.mean to work for all shapes (1,x,3200) for x in [1,2048]. Only then, we can get Llama inference for any number of tokens on input. This second phase can be P1.

So after you finish this first phase, we will move this issue to P1, and go from there.

tenstorrent / tt-metal

[Bug Report] ttnn.mean op - Data Mismatch #13621