Closed BaiqingL closed 1 month ago
Hi! From your log, I did not see the root cause of the NCCL timeout. I wonder if this error is reproducible? It may be a worker got killed somehow or NCCL connection is disrupted somehow. We can first check your NCCL config and here are some ways that help you check the correctness of the NCCL configs. (1) Run official NCCL all_reduce_perf. (2) Try Huggingface multi-GPU debug script. (3) If both tests mentioned above passed, then export NCCL_DEBUG=INFO and rerun the distributed training using our official example, see if the NCCL communications info gives any error or warning, you can paste back the NCCL info for me to double check.
System Info
ml.g5.12xlarge instance from AWS, with pyTorch 2.3.1, 4x A10G, CUDA 12.1
Modified dataset since I already pre-tokenized everything to avoid using time on GPU instances to reduce costs at https://huggingface.co/datasets/BaiqingL/pokemon-rag-llama-3-tokenized
Tokenizer has been modified in the following way
Rest of the training script contains resize.
Dataset has been modified in such a way
And val dataset:
Information
🐛 Describe the bug
After the final step of training, presumably during model saving, process crashes and wastes all that time training... Command executed:
Error logs
Expected behavior
Save the model