Open ujjawalmadan opened 8 months ago
Hi @ujjawalmadan thanks for the detailed error report! I think this is related to a bug in the logging of TRL when packing=True
which displays the (incorrect) total number of steps with reference to the original dataset, not the chunked one.
There's a PR here to fix that: https://github.com/huggingface/trl/pull/979
In general, I think your training runs are fine, it's just a logging issue
Thank you! I just realized this morning that may have been the issue and saw your post. Thanks!
Has anyone else experienced cases where the training finishes early as max length increases?
Ran this script on a custom dataset with the following config. No CUDA errors. It just moved to evaluation before it should have. Also running on a 8XA100 cluster (40GB).
Got this result:
Even after lowering it to 4096 tokens, it still ended early, but this time after 20% When running on the default dataset, same thing occurred but this time at 33%.
Thoughts?