Open LostRuins opened 1 week ago
@LostRuins Apologies I think I might have solved the issue! It seems like the edge case of just a singular token [INST]
/ [/INST]
as ids 3 and 4 did not get tokenized correctly, and so it tokenized it as simply [0]. I accounted for this edge case now!! Apologies on the slowness!
from unsloth.chat_templates import train_on_responses_only
trainer = train_on_responses_only(
trainer,
instruction_part = "[INST]",
response_part = "[/INST]",
)
tokenizes: to
For local machines, please update Unsloth-Zoo to the nightly branch via pip uninstall unsloth-zoo -y && pip install --upgrade --no-cache-dir git+https://github.com/unslothai/unsloth-zoo.git@nightly
@Erland366 You can close the other issue since I just fixed it!
Hi @danielhanchen , thanks for helping, after trying the nightly branch Mistral format seems to be fine, however, I think there are still boundary masking issues, especially visible when using Alpaca format.
Here's an example that can hopefully help reproduce it. Also tested on Mistral Small.
print(dataset['text'][0])
This is my fun little AI chat demo
### Instruction:
What color is apple</s>
### Response:
Apple is red</s>
### Instruction:
What about pear?</s>
### Response:
Pear is green</s>
Now testing the masking
space = tokenizer(" ", add_special_tokens = False).input_ids[0]
from unsloth.chat_templates import train_on_responses_only
trainer = train_on_responses_only(
trainer,
instruction_part = "### Instruction:\n",
response_part = "### Response:\n",
)
tokenizer.decode([space if x == -100 else x for x in trainer.train_dataset[0]["labels"]])
\nApple is red</s>\n\n### \nPear is green</s>
As you can see, the most glaring issue is the presence of the leaked ###
token, as well as the unmasked newline before the response. Please try it and see if you're able to reproduce this, I'd be happy to assist in testing.
I have a suspicion that might help debug this:
There are multiple tokenizations for the sequence ###
in Mistral Small. In particular, a standalone ###
tokenizes to token ID 28100
, howevever adding a newline \n###
tokenizes to token ID 1542
, a completely different token.
This means that ### Instruction
will use the ID 28100
version, while within the actual prompt you are likely to encounter the ID 1542
version after a newline, unless its adjacent to another character e.g. </s>###
in which case it will still be ID 28100.
Hopefully this can be resolved automatically. If not, may I suggest you allow us to specify multiple instruction_part
so that the tuners can add all the "variants" necessary for proper masking after token merges.
Btw if you wanna chat to clarify, I am @concedo
in the unsloth discord, that might be faster.
Trying to finetune Mistral Small
however, I see this warning
And sanity check fails
returns an empty string