hiyouga / LLaMA-Factory

Unified Efficient Fine-Tuning of 100+ LLMs (ACL 2024)
https://arxiv.org/abs/2403.13372
Apache License 2.0
34.82k stars 4.3k forks source link

Feature suggestion: cutoff_len could optionally drop too long examples from dataset. #3995

Open s4s0l opened 5 months ago

s4s0l commented 5 months ago

Sorry if it was discussed somewhere but its hard to search the issues with 'cutoff_len' as it appears everywhere in logs:/

Currently setting a cutoff_len (at least for sft) will trim too long training examples using 'infer_max_len` function. It's a nice trick that takes into consideration that example is a 'pair' and it tries to cut the example in a way that does not cut whole answer. But it would be nice to have an option to not trim anything and simply exclude too long examples. Trimming examples can damage them in a way that will corrupt whole fine tuning process, especially for reasoning or math tasks, where trimming can generate example that looses its original intent or even is simply invalid.

Am I missing some setting, tool?

While hacking around it I noticed that 'Template' cannot control such behaviour because while 'encode_multiturn' returns an array but it actually cannot return here an empty / truncated list, so the changes are not local enough for my python skills / knowledge of codebase to prepare proper PR. My apologies:/

johnmai-dev commented 5 months ago

+1

zhangch-ss commented 2 weeks ago

+1