Train on responses only does not seem to work for Mistral format

unslothai / unsloth

Finetune Llama 3.2, Mistral, Phi, Qwen 2.5 & Gemma LLMs 2-5x faster with 80% less memory

https://unsloth.ai

Apache License 2.0

18.37k stars 1.28k forks source link

Train on responses only does not seem to work for Mistral format #1290

Open LostRuins opened 1 week ago

LostRuins commented 1 week ago

Trying to finetune Mistral Small

from unsloth.chat_templates import train_on_responses_only
trainer = train_on_responses_only(
    trainer,
    instruction_part = "[INST]",
    response_part = "[/INST]",
)

however, I see this warning

For example, are you sure you used `train_on_responses_only` correctly?
Or did you mask our tokens incorrectly? Maybe this is intended?

And sanity check fails

space = tokenizer(" ", add_special_tokens = False).input_ids[0]
tokenizer.decode([space if x == -100 else x for x in trainer.train_dataset[0]["labels"]])

returns an empty string

danielhanchen commented 1 week ago

@LostRuins Apologies I think I might have solved the issue! It seems like the edge case of just a singular token [INST] / [/INST] as ids 3 and 4 did not get tokenized correctly, and so it tokenized it as simply [0]. I accounted for this edge case now!! Apologies on the slowness!

from unsloth.chat_templates import train_on_responses_only
trainer = train_on_responses_only(
    trainer,
    instruction_part = "[INST]",
    response_part = "[/INST]",
)

tokenizes: to

For local machines, please update Unsloth-Zoo to the nightly branch via pip uninstall unsloth-zoo -y && pip install --upgrade --no-cache-dir git+https://github.com/unslothai/unsloth-zoo.git@nightly

@Erland366 You can close the other issue since I just fixed it!

LostRuins commented 6 days ago

Hi @danielhanchen , thanks for helping, after trying the nightly branch Mistral format seems to be fine, however, I think there are still boundary masking issues, especially visible when using Alpaca format.

Here's an example that can hopefully help reproduce it. Also tested on Mistral Small.

print(dataset['text'][0])

This is my fun little AI chat demo

### Instruction:
What color is apple</s>

### Response:
Apple is red</s>

### Instruction:
What about pear?</s>

### Response:
Pear is green</s>

Now testing the masking

space = tokenizer(" ", add_special_tokens = False).input_ids[0]
from unsloth.chat_templates import train_on_responses_only
trainer = train_on_responses_only(
    trainer,
    instruction_part = "### Instruction:\n",
    response_part = "### Response:\n",
)
tokenizer.decode([space if x == -100 else x for x in trainer.train_dataset[0]["labels"]])

                                                                              \nApple is red</s>\n\n###                                                                                                                                    \nPear is green</s>

As you can see, the most glaring issue is the presence of the leaked ### token, as well as the unmasked newline before the response. Please try it and see if you're able to reproduce this, I'd be happy to assist in testing.

LostRuins commented 6 days ago

I have a suspicion that might help debug this:

There are multiple tokenizations for the sequence ### in Mistral Small. In particular, a standalone ### tokenizes to token ID 28100, howevever adding a newline \n### tokenizes to token ID 1542, a completely different token.

This means that ### Instruction will use the ID 28100 version, while within the actual prompt you are likely to encounter the ID 1542 version after a newline, unless its adjacent to another character e.g. </s>### in which case it will still be ID 28100.

Hopefully this can be resolved automatically. If not, may I suggest you allow us to specify multiple instruction_part so that the tuners can add all the "variants" necessary for proper masking after token merges.

Btw if you wanna chat to clarify, I am @concedo in the unsloth discord, that might be faster.