Closed mgerstgrasser closed 10 months ago
@younesbelkada While we're talking about this anyway: Would it potentially make sense to add a more flexible system as an alternative to matching strings here? I have in mind something like passing a function to the data collator that maps example
to a list of (first_token_idx, last_token_idx)
-pairs that mark each part of the example that is assistant-generated, i.e. should be unmasked.
I'm asking for two reasons:
I will probably implement something like this for a project I'm working on - I'd be happy to open a PR once it's done, if you think that would be useful to have in the library.
(Asking this here, as it's broadly related to the issue, but not to the fix in the PR.)
The current code in DataCollatorForCompletionOnlyLM assumes that the first deteced occurence of
instruction_template
comes before the first detected occurence ofresponse_template
. This is reasonable, since in current applications conversations are initiated by the user, not the assistant. However, this can fail if the first instruction is marked differently from all the other instructions, which can if a context-sensitive tokenizer such as Llama-2 tokenizes the instruction_template differently at the start of a string than in the middle.In particular this happens in practice with TinyLlama:
<|user|>
gets tokenized as529, 29989, 1792, 29989, 29958
at the start of a conversation, but as29966, 29989, 1792, 29989, 29958
in later messages.Reproduction snippet:
PR #1185 fixes this, and makes the above snippet mask out all the user messages correctly.