instructlab / training

InstructLab Training Library
Apache License 2.0
13 stars 32 forks source link

Divide by zero error when recalculating save_samples #145

Open RobotSail opened 1 month ago

RobotSail commented 1 month ago

In the current train function, we recompute the save_samples by normalizing it to the batch_size. For instance, we have the following block of code in main_ds.py:

    batch_size = args.effective_batch_size // grad_accum
    args.save_samples = (args.save_samples // batch_size) * batch_size

But when these take on sufficiently small samples such that save_samples < batch_size, then save_samples is reassigned to 0, which leads to DivisionByZero exceptions during the checkpoint saving process, e.g.:

[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/ec2-user/olegs-super-secret-directory/instructlab/venv/lib64/python3.11/site-packages/instructlab/training/main_ds.py", line 796, in <module>
[rank0]:     main(args)
[rank0]:   File "/home/ec2-user/olegs-super-secret-directory/instructlab/venv/lib64/python3.11/site-packages/instructlab/training/main_ds.py", line 572, in main
[rank0]:     train(args, model, tokenizer, train_loader, grad_accum, metric_logger)
[rank0]:   File "/home/ec2-user/olegs-super-secret-directory/instructlab/venv/lib64/python3.11/site-packages/instructlab/training/main_ds.py", line 451, in train
[rank0]:     if global_step * batch_size % args.save_samples == 0:
[rank0]:        ~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~
[rank0]: ZeroDivisionError: integer modulo by zero

Potential solutions:

To ensure that we never have a save_samples value of zero, we can set a lower bound on values for save_samples by doing:

    args.save_samples = math.max((args.save_samples // batch_size) * batch_size, 1)
Maxusmusti commented 1 month ago

Copying from other thread: