My knowledge_train_msgs jsonl file has about 560 samples in it, and my skills_train_msgs jsonl file has about 1700 samples in it. I'm using smaller hardware, hence lower batch sizes (also to workaround #170) and using lora. So, this isn't using the padding-free/granite model code paths, since you can't do that with lora.
Phase 1 of training and eval works fine, but when I get into phase 2 of training I'm seeing a ZeroDivisionError:
Generating train split: 1714 examples [00:00, 2711.04 examples/s]
Data length calculation: 100%|██████████| 1714/1714 [00:01<00:00, 926.85it/s]
Data length calculation: 100%|██████████| 1714/1714 [00:01<00:00, 931.10it/s]
[rank1]: Traceback (most recent call last):
[rank1]: File "/usr/local/lib/python3.11/site-packages/instructlab/training/main_ds.py", line 834, in <module>
[rank1]: main(args)
[rank1]: File "/usr/local/lib/python3.11/site-packages/instructlab/training/main_ds.py", line 522, in main
[rank1]: packing_max_batch_len, grad_accum = find_packing_max_batch_len_and_grad_accum(
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/usr/local/lib/python3.11/site-packages/instructlab/training/multipack_sampler.py", line 213, in find_packing_max_batch_len_and_grad_accum
[rank1]: addition = find_padding_max_batch_len_addition(
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/usr/local/lib/python3.11/site-packages/instructlab/training/multipack_sampler.py", line 138, in find_padding_max_batch_len_addition
[rank1]: avg_ebs = simulate_buckets(
[rank1]: ^^^^^^^^^^^^^^^^^
[rank1]: File "/usr/local/lib/python3.11/site-packages/instructlab/training/multipack_sampler.py", line 110, in simulate_buckets
[rank1]: avg_ebs = len(dataset) / len(simulation_loader)
[rank1]: ~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~
[rank1]: ZeroDivisionError: division by zero
[rank0]: Traceback (most recent call last):
[rank0]: File "/usr/local/lib/python3.11/site-packages/instructlab/training/main_ds.py", line 834, in <module>
[rank0]: main(args)
[rank0]: File "/usr/local/lib/python3.11/site-packages/instructlab/training/main_ds.py", line 522, in main
[rank0]: packing_max_batch_len, grad_accum = find_packing_max_batch_len_and_grad_accum(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.11/site-packages/instructlab/training/multipack_sampler.py", line 213, in find_packing_max_batch_len_and_grad_accum
[rank0]: addition = find_padding_max_batch_len_addition(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.11/site-packages/instructlab/training/multipack_sampler.py", line 138, in find_padding_max_batch_len_addition
[rank0]: avg_ebs = simulate_buckets(
[rank0]: ^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.11/site-packages/instructlab/training/multipack_sampler.py", line 110, in simulate_buckets
[rank0]: avg_ebs = len(dataset) / len(simulation_loader)
[rank0]: ~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~
[rank0]: ZeroDivisionError: division by zero
W0821 21:23:04.443000 140021192968000 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 6462 closing signal SIGTERM
E0821 21:23:04.720000 140021192968000 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 1 (pid: 6463) of binary: /usr/bin/python3.11
This may be fixed by #169 as that changes lots of the code here, but I wanted to document the issue here in case it's not or if others run into the same issue.
My training command:
My
knowledge_train_msgs
jsonl file has about 560 samples in it, and myskills_train_msgs
jsonl file has about 1700 samples in it. I'm using smaller hardware, hence lower batch sizes (also to workaround #170) and using lora. So, this isn't using the padding-free/granite model code paths, since you can't do that with lora.Phase 1 of training and eval works fine, but when I get into phase 2 of training I'm seeing a ZeroDivisionError:
This may be fixed by #169 as that changes lots of the code here, but I wanted to document the issue here in case it's not or if others run into the same issue.