EOS Token Warning - Githubissues

mike-schiller commented 2 months ago

I’m having an issue fine tuning Meta-Llama-3.1-8B-Instruct. Starting with that model downloaded from Hugging Face, I created some dummy training data just to ensure that I can successfully fine tune the model. I have a train.jsonl file and a valid.jsonl file both having lines in the following format. I believe this format is what I need to provide to do fine tuning on a Llama Instruct model, but please correct me if I’m wrong.

(I am not certain if the \n\n characters are actually supposed to be in the text or not, but I get similar results regardless.)

{"text": "<|begin_of_text|><|start_header_id|>system<|end_header_id|>You are a helpful chatbot.<|eot_id|><|start_header_id|>user<|end_header_id|>Tell me about Player_1.<|eot_id|><|start_header_id|>assistant<|end_header_id|>Player_1 is a Shooting Guard for the Crimson Falcons. Known for their unique playing style, they have become a key figure in the Crimson Falcons, excelling in both offense and defense. Their contributions have led to multiple wins this season.<|eot_id|>"}

I run training with the command: % python3 -m mlx_lm.lora --train --model ./Meta-Llama-3.1-8B-Instruct --data fictional-basketball-data/llama/ --batch-size 4 --lora-layers 16 --iters 1000

And get the output below (yeah, I need to install a newer version of python)

/Users/mike/Library/Python/3.9/lib/python/site-packages/urllib3/__init__.py:35: NotOpenSSLWarning: urllib3 v2 only supports OpenSSL 1.1.1+, currently the 'ssl' module is compiled with 'LibreSSL 2.8.3'. See: https://github.com/urllib3/urllib3/issues/3020
  warnings.warn(
Loading pretrained model
Loading datasets
Training
Trainable parameters: 0.021% (1.704M/8030.261M)
Starting training..., iters: 1000
[WARNING] Example already has an EOS token appended
[WARNING] Example already has an EOS token appended
[WARNING] Example already has an EOS token appended

Those warnings are quite verbose. I don’t think anything is actually going wrong based on my understanding of the code and this issue: https://github.com/ml-explore/mlx-examples/issues/900, but I’ve wasted a good bit of time trying to figure that out. If this is just a warning, it seems like it would make sense to implement the suggestion in that issue. It's a bit discouraging to kick off a training run that you know will take 20 min, see hundreds of warnings, and not know if you're wasting your time or not. At a minimum it would be better if the text was [INFO]..., [DEBUG]..., or [TRACE].. as each of those concepts better aligns with the intent of that print() than warning.

Finally, is there an example somewhere of fine tuning LLama 3.1 8B Instruct using mlx with JSONL data? I’m having trouble determining whether the format shown in the JSONL above is correct or not.

Thanks!

awni commented 2 months ago

Your data looks fine except it's better not to add the EOS token as MLX LM will do it automatically. E.g.

... multiple wins this season.<|eot_id|>

You could remove the <eot_id> to get rid of the warning.

Even though the behavior is correct even if you include the <eot_id> (since we won't add it in that case, we still warn to make sure it was intentional). I think it's probably safe to remove that warning. I'm not sure it's adding a lot of value at this point and mostly just confusion..sorry for that.

I'll send a PR with it tomorrow.

awni commented 2 months ago

Removed the warning in the latest MLX LM.

ml-explore / mlx-examples

EOS Token Warning #1002