DataCollatorForCompletionOnlyLM does not recognize a given response template within a prompt

superleesa commented 10 months ago

I noticed that sometimes, DataCollatorForCompletionOnlyLM does not recognize a response template within a given prompt. In particular, this occurs whenever the tokenizer tokenizes symbols at the end of the response key differently from how the one that appears on the prompt is tokenized.

I'm currently using T5 tokenizer with GPT Neo X (which might be weird but the pre-trained model was using it so....).

Example (note: I'm fine-tuning a Japanese LLM but this problem should not be limited to Japanese models): With the response key ### 回答: and the input ### 指示:<NL>素敵なメッセージのある古典的な英語の詩のリストを教えてください。<NL><NL>**### 回答:**1.ラドヤード・キプリングの "If":この詩は、誠実な人生を送ることの重要性を強調し、人生についてより明確な見通しを得ることを目的としている。<NL>2."Requiescat"(オスカー・ワイルド作):悲嘆、喪失感、喪に服すことが多くあります。不幸にも不慮の死を遂げた妹に捧げた詩。<NL>3."And Still I Rise" by Maya Angelou:この詩は、人生や社会が投げかけるあらゆる試練に打ち勝ち、人生を前進させるための希望、勇気、不屈の姿勢についてのすべてです。<NL>4.ディラン・トーマスの "Do not go gentle into that good night":ウェールズの詩人、ディラン・トーマスの代表作。つまり、この詩は、命の尊さを優しく教えてくれるものなのです。</s>[PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD], I get a following warning and the example will be ignored by the model.

UserWarning: Could not find response key### 回答:in the following instance: ### 指示:<NL>素敵なメッセージのある古典的な英語の詩のリストを教えてください。<NL><NL>### 回答:1.ラドヤード・キプリングの "If":この詩は、誠実な人生を送ることの重要性を強調し、人生についてより明確な見通しを得ることを目的としている。<NL>2."Requiescat"(オスカー・ワイルド作):悲嘆、喪失感、喪に服すことが多くあります。不幸にも不慮の死を遂げた妹に捧げた詩。<NL>3."And Still I Rise" by Maya Angelou:この詩は、人生や社会が投げかけるあらゆる試練に打ち勝ち、人生を前進させるための希望、勇気、不屈の姿勢についてのすべてです。<NL>4.ディラン・トーマスの "Do not go gentle into that good night":ウェールズの詩人、ディラン・トーマスの代表作。つまり、この詩は、命の尊さを優しく教えてくれるものなのです。</s>[PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD] This instance will be ignored in loss calculation. Note, if this happens often, consider increasing themax_seq_length.

However, we can see that the response key "### 回答:" is definitely within the prompt.

Upon debugging, I noticed that ":1" was combined to a subword and tokenized as token id of 1852, while the token id of ":" is 276. I also noticed that this problem arises with a lot more examples, but not for some examples (because such subword did not exist).

I solved this problem by adding a space after the response key (i.e. from "### 回答:" to "### 回答: "). However, if there was a token for ": 1", this wouldn't have worked.

I thought there should be a way to handle this problem.

superleesa commented 10 months ago

perhaps we can compare the original texts, not the input ids? or add a special token for this purpose?

superleesa commented 10 months ago

adding special token seems working and like the simplest idea:

Add new token: tokenizer.add_special_tokens({"additional_special_tokens": ["[Answer]"]})

Update model: model.resize_token_embeddings(len(tokenizer))

Update prompt: ### 指示:<NL>素敵なメッセージのある古典的な英語の詩のリストを教えてください。<NL><NL>**[Answer]**1.ラドヤード・キプリングの "If":この詩は、誠実な人生を送ることの重要性を強調し、人生についてより明確な見通しを得ることを目的としている。<NL>2."Requiescat"(オスカー・ワイルド作):悲嘆、喪失感、喪に服すことが多くあります。不幸にも不慮の死を遂げた妹に捧げた詩。<NL>3."And Still I Rise" by Maya Angelou:この詩は、人生や社会が投げかけるあらゆる試練に打ち勝ち、人生を前進させるための希望、勇気、不屈の姿勢についてのすべてです。<NL>4.ディラン・トーマスの "Do not go gentle into that good night":ウェールズの詩人、ディラン・トーマスの代表作。つまり、この詩は、命の尊さを優しく教えてくれるものなのです。</s>[PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD]

Update response template: response_template="[Answer]"

younesbelkada commented 10 months ago

Nice, thanks for investigating @superleesa ! Note for that case you can pass directly the expected input_ids of the response template https://huggingface.co/docs/trl/sft_trainer#using-tokenids-directly-for-responsetemplate

superleesa commented 10 months ago

@younesbelkada Hi, thanks for the suggestion! I read the article but I don't think it will solve my problem. To clarify my points I will give you an example:

A response template "### Answer: " can be encoded to [1, 1, 1, 2, 3, 4] (I am making up these token ids; 1 for "#", 2 for a space, 3 for "Answer", 4 for ": "). Let's say I'm dealing with a batch of sentences. Two of them are:

"What is your name? ### Answer: As a Language Model I cannot answer that question."
"What is today's weather? ### Answer: I cannot access to weather data.",

Let's also assume that in the tokenizer, there were two tokens ": As" (id=5) and ": I" (id=6). Then, the part where the response template is will be encoded to [1, 1, 1, 2, 5] and [1, 1, 1, 2, 6], respectively for the first and second sentences.

Now you can see that, although both of these sentences do have the response template, the part where the response template is are encoded differently for each sentence. The important implication here is that, they can be different for ALL sentences depending on the succeeding characters (of the response template in each sentence) and the tokenizer.

Therefore, I believe that passing one response key (even if it's encoded before hand, as the example in the article you shared with me) will not help to solve this problem.

github-actions[bot] commented 9 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

stvhuang commented 9 months ago

met same problem

github-actions[bot] commented 8 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

huggingface / trl

DataCollatorForCompletionOnlyLM does not recognize a given response template within a prompt #1183