Closed jshin49 closed 1 month ago
Async and sync use exact same functions in the backend. However, matrix multiplications kernels in mixed precision are not deterministic and can lead to difference in generations when the batch size increase. do_sample = False does not do anything when top_k is set. Sampling will be activated anyways. top_k = 1 might be the reeason of the weird behaviour.
Can you try to reproduce the error without top_k? Just using greedy decoding? My bet is that the multinomial is doing some weird things.
See: https://github.com/pytorch/pytorch/issues/48841 https://github.com/huggingface/transformers/issues/22979
Ok, when I tried this with a custom kernel it seems that the generation is stable (even with 128 async requests).
I couldn't reproduce the error. However, I tried this with a custom-built kernel (following the local installation steps) so it's not exactly same as the above environment. Let me try reproducing it with the Original environment to see if it's any different.
Here's one thing I can confirm though: the generations of using top_k=1
and top_k=None
are different for sure.
@jshin49 yes top_k=1
is not equivalent to greedy. Top k will sample from tokens with scores >= the kth highest score. This means that it could be choosing from more than k tokens if there is a tie for kth place, and in particular when k=1 it will sample randomly from all the tokens that are tied with the highest score. Greedy uses argmax which will deterministically choose the token that has the highest score and the lowest id.
Intuitively to me at least, this makes k=1
sampling with a fixed random seed preferable to greedy, since with greedy you can end up with an unintended bias towards tokens with lower ids.
Note that with 16 bits, such score collisions are quite common, especially with the larger vocab sizes.
Any update on this?
Note that with 16 bits, such score collisions are quite common, especially with the larger vocab sizes.
For the score collisions with 16bits, could you please give some examples, or relevant references? @njhill
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
System Info
Information
Tasks
Reproduction
top_k=1
&do_sample=false
Expected behavior
Aysnc and sync requests to have same generation results for the same prompt and parameters.
The funny thing is, when I send the same amount of requests synchronously, the generations are stable.
You can also see from the above image that the model even degenerates sometimes. This behavior happened when I overloaded the model with 100 async requests from two different user endpoints.
Basically, the model gets worse when I send more requests simultaneously
I'm guessing this has to do something with the continuous batching feature?