instructlab / training

InstructLab Training Library
Apache License 2.0
13 stars 35 forks source link

Infinite loop until int overflow when calculating batch sizes for small datasets and using padding #170

Open bbrowning opened 4 weeks ago

bbrowning commented 4 weeks ago

When attempting to run training a small dataset (smaller than the default effective batch size), the training code gets into an loop at https://github.com/instructlab/training/blob/9e2ac746a877e2b2ed6ff2ba54a04c1a22dadb84/src/instructlab/training/multipack_sampler.py#L137 until it overflows integer values and crashes.

I initially triggered this by doing an ilab data generate (accepting all defaults), followed by ilab model train --strategy lab-multiphase --phased-phase1-data ... --phased-phase2-data ... with the CPU-only training profile (using a GPU accelerated nvidia host, but not a massive GPU) using ilab 0.18.0.rc7.

bbrowning commented 4 weeks ago

I added a test and a potential fix for this on my fork of this repo at https://github.com/bbrowning/instructlab-training/commit/76127179df5babe6804d0fc11975f43681fe4176 . However, given the incoming refactoring I see ap/fix_multipack_plus_dolomite_saving branch I'm not actually opening that fix as a PR yet. If that branch is not expected to merge any time soon, then we might want a more tactical PR to fix this issue if others start to hit it as they test out 0.18.x releases.

Maxusmusti commented 3 weeks ago

PR #169 should be merged as soon as it's been tested and the outstanding linting/sign-off checks are corrected