Closed zhaoshitian closed 1 year ago
Hi, we are actually removing the tokens with cumulative probability > top_p
. This means that we only keep the top tokens such that their sum is equal to top_p
(for example, if we set top_p = 0.9
and the first 10 tokens sum to 0.9
, we will discard every token after 10 and sample only from the first 10.
This is standard nucleus sampling. Hope that makes sense!
I understand! Thank you so much!!
In the generate method of the model, I notice that you remove some tokens, according to the top_p value. I don't understand why you remove the tokens with high logits, could you give me some material about this? Appreciate it so much!