Closed Ubadub closed 7 months ago
Thanks for raising this issue! When LMs don't use a BOS token, it makes no sense to have probabilities of the first token, since logits are computed given some context. This is the default case.
But for cases where LMs do have a BOS token, the first token ends up being the BOS token itself and now it is the BOS token that is assigned 0 probability -- in such cases you could enable the bos_token
option by setting it to True
.
The compute_stats
function doesn't need to include this functionality since all of this is handled by the prepare_text
and the prime_text
functions, all compute_stats
does is handle logits given some input.
Does this make sense?
Thanks for raising this issue! When LMs don't use a BOS token, it makes no sense to have probabilities of the first token, since logits are computed given some context. This is the default case.
But for cases where LMs do have a BOS token, the first token ends up being the BOS token itself and now it is the BOS token that is assigned 0 probability -- in such cases you could enable the
bos_token
option by setting it toTrue
.The
compute_stats
function doesn't need to include this functionality since all of this is handled by theprepare_text
and theprime_text
functions, allcompute_stats
does is handle logits given some input.Does this make sense?
Yes, and thank you for your quick reply. I understand. Would you happen to have any advice for doing a BLiMP type experiment for a model that does not take use a BOS token? If all sentence pairs in the BLiMP corpus had an identical first word, this wouldn't matter, but for some pairs, this is not the case (e.g. matrix_question_npi_licensor_present
and left_branch_island_echo_question
). Is the simple LM BLiMP evaluation method intelligible in such cases?
I think you're making an important point here with the difference in first word for LMs without a BOS token! For these cases, I guess the difference in the first token would be more indirectly reflected in the log-probs assigned by the model in the sense that p(apple | an) is likely much much greater than p(apple | a). I am unsure about other solutions...
Thanks for your very helpful responses (and the very helpful library). Yes I did some more thinking about this and I understand your point. I'll close the issue now :)
No worries -- I still think your point holds, fwiw!
Consider this section of code from
IncrementalLMScorer
:If I'm understanding this correctly, the class is discarding the probability the model assigns to the first token in every element of a batch. I understand why such logic would make sense in the context of a model that uses a BOS token; but does this mean that this class is unusable for models that do not use a BOS token? It is not at all clear from the docs that this class is only meant to be used with BOS token models.
A BOS token is mentioned in other places in the code- for example, as a Boolean argument to
prepare_text
- but in such places it's clearly marked as optional, with the default beingFalse
. So I'm a little confused by the lack of such optionality in the function above.Am I understanding this correctly? If so, is there a workaround (besides reduplication of the code)?