Open bbrowning opened 4 weeks ago
I added a test and a potential fix for this on my fork of this repo at https://github.com/bbrowning/instructlab-training/commit/76127179df5babe6804d0fc11975f43681fe4176 . However, given the incoming refactoring I see ap/fix_multipack_plus_dolomite_saving
branch I'm not actually opening that fix as a PR yet. If that branch is not expected to merge any time soon, then we might want a more tactical PR to fix this issue if others start to hit it as they test out 0.18.x releases.
PR #169 should be merged as soon as it's been tested and the outstanding linting/sign-off checks are corrected
When attempting to run training a small dataset (smaller than the default effective batch size), the training code gets into an loop at https://github.com/instructlab/training/blob/9e2ac746a877e2b2ed6ff2ba54a04c1a22dadb84/src/instructlab/training/multipack_sampler.py#L137 until it overflows integer values and crashes.
I initially triggered this by doing an
ilab data generate
(accepting all defaults), followed byilab model train --strategy lab-multiphase --phased-phase1-data ... --phased-phase2-data ...
with the CPU-only training profile (using a GPU accelerated nvidia host, but not a massive GPU) using ilab 0.18.0.rc7.