instructlab / training

InstructLab Training Library
Apache License 2.0
13 stars 34 forks source link

ZeroDivisionError from multipack_sampler in phase 2 of training #177

Open bbrowning opened 3 weeks ago

bbrowning commented 3 weeks ago

My training command:

ilab model train \
  --strategy lab-multiphase \
  --enable-serving-output \
  --phased-phase1-data /instructlab/data/instructlab/datasets/knowledge_train_msgs_2024-08-21T19_24_42.jsonl \
  --phased-phase1-num-epochs 1 \
  --phased-phase2-data /instructlab/data/instructlab/datasets/skills_train_msgs_2024-08-21T19_24_42.jsonl \
  --phased-phase2-num-epochs 1 \
  --effective-batch-size 500 \
  --max-batch-len 4096 \
  --lora-rank 8 \
  --gpus 2

My knowledge_train_msgs jsonl file has about 560 samples in it, and my skills_train_msgs jsonl file has about 1700 samples in it. I'm using smaller hardware, hence lower batch sizes (also to workaround #170) and using lora. So, this isn't using the padding-free/granite model code paths, since you can't do that with lora.

Phase 1 of training and eval works fine, but when I get into phase 2 of training I'm seeing a ZeroDivisionError:

Generating train split: 1714 examples [00:00, 2711.04 examples/s]
Data length calculation: 100%|██████████| 1714/1714 [00:01<00:00, 926.85it/s] 
Data length calculation: 100%|██████████| 1714/1714 [00:01<00:00, 931.10it/s]
[rank1]: Traceback (most recent call last):
[rank1]:   File "/usr/local/lib/python3.11/site-packages/instructlab/training/main_ds.py", line 834, in <module>
[rank1]:     main(args)
[rank1]:   File "/usr/local/lib/python3.11/site-packages/instructlab/training/main_ds.py", line 522, in main
[rank1]:     packing_max_batch_len, grad_accum = find_packing_max_batch_len_and_grad_accum(
[rank1]:                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/usr/local/lib/python3.11/site-packages/instructlab/training/multipack_sampler.py", line 213, in find_packing_max_batch_len_and_grad_accum
[rank1]:     addition = find_padding_max_batch_len_addition(
[rank1]:                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/usr/local/lib/python3.11/site-packages/instructlab/training/multipack_sampler.py", line 138, in find_padding_max_batch_len_addition
[rank1]:     avg_ebs = simulate_buckets(
[rank1]:               ^^^^^^^^^^^^^^^^^
[rank1]:   File "/usr/local/lib/python3.11/site-packages/instructlab/training/multipack_sampler.py", line 110, in simulate_buckets
[rank1]:     avg_ebs = len(dataset) / len(simulation_loader)
[rank1]:               ~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~
[rank1]: ZeroDivisionError: division by zero
[rank0]: Traceback (most recent call last):
[rank0]:   File "/usr/local/lib/python3.11/site-packages/instructlab/training/main_ds.py", line 834, in <module>
[rank0]:     main(args)
[rank0]:   File "/usr/local/lib/python3.11/site-packages/instructlab/training/main_ds.py", line 522, in main
[rank0]:     packing_max_batch_len, grad_accum = find_packing_max_batch_len_and_grad_accum(
[rank0]:                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.11/site-packages/instructlab/training/multipack_sampler.py", line 213, in find_packing_max_batch_len_and_grad_accum
[rank0]:     addition = find_padding_max_batch_len_addition(
[rank0]:                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.11/site-packages/instructlab/training/multipack_sampler.py", line 138, in find_padding_max_batch_len_addition
[rank0]:     avg_ebs = simulate_buckets(
[rank0]:               ^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.11/site-packages/instructlab/training/multipack_sampler.py", line 110, in simulate_buckets
[rank0]:     avg_ebs = len(dataset) / len(simulation_loader)
[rank0]:               ~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~
[rank0]: ZeroDivisionError: division by zero
W0821 21:23:04.443000 140021192968000 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 6462 closing signal SIGTERM
E0821 21:23:04.720000 140021192968000 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 1 (pid: 6463) of binary: /usr/bin/python3.11

This may be fixed by #169 as that changes lots of the code here, but I wanted to document the issue here in case it's not or if others run into the same issue.

Maxusmusti commented 3 weeks ago

Yes, that should also be resolved by #169