In the current train function, we recompute the save_samples by normalizing it to the batch_size.
For instance, we have the following block of code in main_ds.py:
But when these take on sufficiently small samples such that save_samples < batch_size, then save_samples is reassigned to 0, which leads to DivisionByZero exceptions during the checkpoint saving process, e.g.:
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/ec2-user/olegs-super-secret-directory/instructlab/venv/lib64/python3.11/site-packages/instructlab/training/main_ds.py", line 796, in <module>
[rank0]: main(args)
[rank0]: File "/home/ec2-user/olegs-super-secret-directory/instructlab/venv/lib64/python3.11/site-packages/instructlab/training/main_ds.py", line 572, in main
[rank0]: train(args, model, tokenizer, train_loader, grad_accum, metric_logger)
[rank0]: File "/home/ec2-user/olegs-super-secret-directory/instructlab/venv/lib64/python3.11/site-packages/instructlab/training/main_ds.py", line 451, in train
[rank0]: if global_step * batch_size % args.save_samples == 0:
[rank0]: ~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~
[rank0]: ZeroDivisionError: integer modulo by zero
Potential solutions:
To ensure that we never have a save_samples value of zero, we can set a lower bound on values for save_samples by doing:
In the current
train
function, we recompute thesave_samples
by normalizing it to the batch_size. For instance, we have the following block of code inmain_ds.py
:But when these take on sufficiently small samples such that
save_samples < batch_size
, thensave_samples
is reassigned to 0, which leads to DivisionByZero exceptions during the checkpoint saving process, e.g.:Potential solutions:
To ensure that we never have a save_samples value of zero, we can set a lower bound on values for save_samples by doing: