tenstorrent / tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.
Apache License 2.0
396 stars 48 forks source link

Llama3.1-8B and Mixtral non-deterministic output in demo_with_prefill.py #11850

Closed mtairum closed 1 week ago

mtairum commented 3 weeks ago

Describe the bug

Running Llama3.1-8B demo with prefill results in different outputs. All good outputs, but always different.

For now we've disabled the output token validation from main to avoid blocking CI.

To Reproduce

Tested on latest main ce56b42712429416485b377d30f88500bf243dfa.

Also tried the fixture (reset_seeds) but no luck. Double checked and we are doing argmax, so not sure where the variability is coming from.

I tried with a more robust prompt Can you describe and comment the following number sequence? 2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89, 97, 101, 103, 107, 109, 113, 127, 131, 137, 139, 149, 151, 157, 163, but as can be seen below, it still outputs a slightly different answer every time.

pytest models/demos/wormhole/llama31_8b/demo/demo_with_prefill.py::test_llama_demo[instruct_weights-1_batch]

Bad output

I've run multiple times and got different variations of the output. Examples below.

<|begin_of_text|><|start_header_id|>user<|end_header_id|>Can you describe and comment the following number sequence? 2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89, 97, 101, 103, 107, 109, 113, 127, 131, 137, 139, 149, 151, 157, 163<|eot_id|><|start_header_id|>assistant<|end_header_id|>

The given number sequence appears to be a list of prime numbers. A prime number is a positive integer that is divisible only by itself and 1. In other words, it is a number that is not divisible by any other number except for 1 and itself.

Here are some observations and comments about the sequence:\n\n* The sequence starts with 2, which is the smallest prime number.
* The sequence consists of consecutive prime numbers, which is a notable property of prime numbers.
* The sequence appears to be a random collection of prime numbers, but it's actually a list of

---------

<|begin_of_text|><|start_header_id|>user<|end_header_id|>Can you describe and comment the following number sequence? 2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89, 97, 101, 103, 107, 109, 113, 127, 131, 137, 139, 149, 151, 157, 163<|eot_id|><|start_header_id|>assistant<|end_header_id|>

The given number sequence appears to be a list of prime numbers. A prime number is a positive integer that is divisible only by itself and 1.

Here are some observations and comments about the sequence:

1. **Consecutive primes**: The sequence consists of consecutive prime numbers, which is a notable property. This means that each number in the sequence is a prime number, and the sequence is ordered in ascending order.

2. **No gaps**: There are no gaps in the sequence, meaning that every prime number up to a certain limit (in this case, 101) is

------

<|begin_of_text|><|start_header_id|>user<|end_header_id|>Can you describe and comment the following number sequence? 2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89, 97, 101, 103, 107, 109, 113, 127, 131, 137, 139, 149, 151, 157, 163<|eot_id|><|start_header_id|>assistant<|end_header_id|>

The given number sequence appears to be a list of prime numbers. A prime number is a positive integer that is divisible only by itself and 1. In other words, it is a number that is not divisible by any other number except for 1 and itself.

Here are some observations and comments about the sequence:

1. **Consecutive primes**: The sequence starts with 2, which is the smallest prime number. The subsequent numbers in the sequence are consecutive prime numbers, i.e., 2, 3, 5, 7, 11, etc. are
mtairum commented 3 weeks ago

This seems to be happening to Mixtral as well. Tested on branch mixtral-32k-demo

cglagovichTT commented 2 weeks ago

We see ND PCC in our t3k Llama attention tests. We isolated it to this commit https://github.com/tenstorrent/tt-metal/commit/7b8e627c078e8262a22df07627b00b2f1d645abb#diff-dcb2b8d4bed26e70b5f09bd34fcce0e81d7a8770a96ecf79ac753e4e27559af2

yieldthought commented 2 weeks ago

Probably the same: https://github.com/tenstorrent/tt-metal/issues/11438

mtairum commented 2 weeks ago

For llama3.1-8B on branch aho/unpacker-delay I'm not seeing variability anymore.

The test_model (length = 512) is showing PCC = 0.9632 (we had a cutoff a 0.94), so this was an improvement as well. For lengtht of 4k the PCC is still 0.8667 (the same as https://github.com/tenstorrent/tt-metal/issues/11438), so no change there.

uaydonat commented 2 weeks ago

Correct me if I am wrong, but both the commit that caused the non-determinism and the fix are not expected to change the math, so we should expect to get the same PCC as before (unless some other change in between changed the math). The fact that pcc is not the same, even if it is 0.96, might point to a problem.

mtairum commented 2 weeks ago

@uaydonat My bad on the previous comment. I should've added a 'probably' or double checked with older pipelines.

Although our PCC cutoff is 0.94, before this variability issue was introduced the PCC was already 0.9632( checked runs from 1 week and 2 weeks ago), so there wasn't any change there, as expected.

uaydonat commented 1 week ago

Is the fix in main? Should we close this?

mtairum commented 1 week ago

Yes this is in main. Closed.