Closed accupham closed 5 months ago
As of commit 20aeadd this should be fixed in all cases, for all combinations of hyperparameters. The only remaining caveat, is that now there might be one batch that could be much smaller than all the others. This doesn't matter for eval, but for training it means the gradient for just one step is based on much less tokens than the other steps. Theoretically that might cause instability, but for just a single step I don't think it matters.
I reworked the code quite a bit, so let me know if something is breaking or not working as expected.
If you set the eval multiplier to something small, such that there are too few eval samples, sometimes
global_batches
is empty before the "sort by largest" swap here: https://github.com/tdrussell/qlora-pipe/blob/main/dataloader.py#L120, causing an index error.