tdrussell / qlora-pipe

A pipeline parallel training script for LLMs.
MIT License
83 stars 8 forks source link

DistributedBatchSamper crashes on small eval sample sizes #2

Closed accupham closed 5 months ago

accupham commented 7 months ago

If you set the eval multiplier to something small, such that there are too few eval samples, sometimes global_batches is empty before the "sort by largest" swap here: https://github.com/tdrussell/qlora-pipe/blob/main/dataloader.py#L120, causing an index error.

tdrussell commented 5 months ago

As of commit 20aeadd this should be fixed in all cases, for all combinations of hyperparameters. The only remaining caveat, is that now there might be one batch that could be much smaller than all the others. This doesn't matter for eval, but for training it means the gradient for just one step is based on much less tokens than the other steps. Theoretically that might cause instability, but for just a single step I don't think it matters.

I reworked the code quite a bit, so let me know if something is breaking or not working as expected.