在进行Building trainer时，训练会卡住；

princeton-nlp / LLM-Shearing

[ICLR 2024] Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning

MIT License

553 stars 44 forks source link

你好，我使用的是样例测试集，想跑通README. 但是发现，在训练的时候，会卡住，然后超时； [batch=23/3200]: Train time/batch: 22 Train time/sample: 198 Train time/batch_in_epoch: 6 Train time/sample_in_epoch: 54 Train time/token: 811008 Train time/token_in_epoch: 221184 Train metrics/train/cc_weight: 0.6700 Train metrics/train/github_weight: 0.0450 Train metrics/train/book_weight: 0.0450 Train metrics/train/stackexchange_weight: 0.0200 Train metrics/train/wiki_weight: 0.0450 Train metrics/train/arxiv_weight: 0.0250 Train metrics/train/c4-rp_weight: 0.1500 Train memory/current_allocated_mem: 36.8820 Train memory/current_active_mem: 36.8820 Train memory/current_inactive_mem: 0.1744 Train memory/current_reserved_mem: 55.9060 Train memory/peak_allocated_mem: 42.9380 Train memory/peak_active_mem: 42.9380 Train memory/peak_inactive_mem: 7.8742 Train memory/peak_reserved_mem: 55.9060 Train memory/alloc_retries: 0 Train metrics/train/expected_head_sparsity: 0.0039 Train metrics/train/target_head_sparsity: 0.0129 Train metrics/train/expected_intermediate_sparsity: 0.0039 Train metrics/train/target_intermediate_sparsity: 0.0128 Train metrics/train/expected_layer_sparsity: 0.0039 Train metrics/train/target_layer_sparsity: 0.0000 Train metrics/train/expected_hidden_sparsity: 0.0039 Train metrics/train/target_hidden_sparsity: 0.0129 Train metrics/train/expected_sparsity: 0.0117 Train metrics/train/target_sparsity: 0.0209 Train trainer/device_train_microbatch_size: 3 Train loss/train/total: 1.4801 Train loss/train/ce_loss: 1.4716 Train loss/train/lag_loss: 0.0085 Train metrics/train/LanguageCrossEntropy: 1.4716 Train metrics/train/Perplexity: 4.3561 Train metrics/train/cc_LanguageCrossEntropy: 1.1558 Train metrics/train/cc_count: 65 Train metrics/train/github_LanguageCrossEntropy: nan Train metrics/train/github_count: 7 Train metrics/train/book_LanguageCrossEntropy: nan Train metrics/train/book_count: 7 Train metrics/train/stackexchange_LanguageCrossEntropy: 2.1491 Train metrics/train/stackexchange_count: 3 Train metrics/train/wiki_LanguageCrossEntropy: 1.5306 Train metrics/train/wiki_count: 8 Train metrics/train/arxiv_LanguageCrossEntropy: nan Train metrics/train/arxiv_count: 6 Train metrics/train/c4-rp_LanguageCrossEntropy: 1.6471 Train metrics/train/c4-rp_count: 111 Train throughput/batches_per_sec: 0.0914 Train throughput/samples_per_sec: 0.8223 Train throughput/device/batches_per_sec: 0.0305 Train throughput/device/samples_per_sec: 0.2741 Train throughput/tokens_per_sec: 3368.2385 Train throughput/device/tokens_per_sec: 1122.7462 Train throughput/flops_per_sec: 157886485043818.8125 Train throughput/device/flops_per_sec: 52628828347939.6016 Train throughput/device/mfu: 0.1687 Train time/train: 0.0709 Train time/val: 0.0000 Train time/total: 0.0709 Train lr-DecoupledAdamW/group0: 0.0000 Train lr-DecoupledAdamW/group1: 0.0688 Train lr-DecoupledAdamW/group2: -0.0688 [E ProcessGroupNCCL.cpp:828] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3777, OpType=_ALLGATHER_BASE, Timeout(ms)=1800000) ran for 1802129 milliseconds before timing out. [E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down. terminate called after throwing an instance of 'std::runtime_error' what(): [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3777, OpType=_ALLGATHER_BASE, Timeout(ms)=1800000) ran for 1802129 milliseconds before timing out.

你好，我使用的是样例测试集，想跑通README. 但是发现，在训练的时候，会卡住，然后超时； [batch=23/3200]: Train time/batch: 22 Train time/sample: 198 Train time/batch_in_epoch: 6 Train time/sample_in_epoch: 54 Train time/token: 811008 Train time/token_in_epoch: 221184 Train metrics/train/cc_weight: 0.6700 Train metrics/train/github_weight: 0.0450 Train metrics/train/book_weight: 0.0450 Train metrics/train/stackexchange_weight: 0.0200 Train metrics/train/wiki_weight: 0.0450 Train metrics/train/arxiv_weight: 0.0250 Train metrics/train/c4-rp_weight: 0.1500 Train memory/current_allocated_mem: 36.8820 Train memory/current_active_mem: 36.8820 Train memory/current_inactive_mem: 0.1744 Train memory/current_reserved_mem: 55.9060 Train memory/peak_allocated_mem: 42.9380 Train memory/peak_active_mem: 42.9380 Train memory/peak_inactive_mem: 7.8742 Train memory/peak_reserved_mem: 55.9060 Train memory/alloc_retries: 0 Train metrics/train/expected_head_sparsity: 0.0039 Train metrics/train/target_head_sparsity: 0.0129 Train metrics/train/expected_intermediate_sparsity: 0.0039 Train metrics/train/target_intermediate_sparsity: 0.0128 Train metrics/train/expected_layer_sparsity: 0.0039 Train metrics/train/target_layer_sparsity: 0.0000 Train metrics/train/expected_hidden_sparsity: 0.0039 Train metrics/train/target_hidden_sparsity: 0.0129 Train metrics/train/expected_sparsity: 0.0117 Train metrics/train/target_sparsity: 0.0209 Train trainer/device_train_microbatch_size: 3 Train loss/train/total: 1.4801 Train loss/train/ce_loss: 1.4716 Train loss/train/lag_loss: 0.0085 Train metrics/train/LanguageCrossEntropy: 1.4716 Train metrics/train/Perplexity: 4.3561 Train metrics/train/cc_LanguageCrossEntropy: 1.1558 Train metrics/train/cc_count: 65 Train metrics/train/github_LanguageCrossEntropy: nan Train metrics/train/github_count: 7 Train metrics/train/book_LanguageCrossEntropy: nan Train metrics/train/book_count: 7 Train metrics/train/stackexchange_LanguageCrossEntropy: 2.1491 Train metrics/train/stackexchange_count: 3 Train metrics/train/wiki_LanguageCrossEntropy: 1.5306 Train metrics/train/wiki_count: 8 Train metrics/train/arxiv_LanguageCrossEntropy: nan Train metrics/train/arxiv_count: 6 Train metrics/train/c4-rp_LanguageCrossEntropy: 1.6471 Train metrics/train/c4-rp_count: 111 Train throughput/batches_per_sec: 0.0914 Train throughput/samples_per_sec: 0.8223 Train throughput/device/batches_per_sec: 0.0305 Train throughput/device/samples_per_sec: 0.2741 Train throughput/tokens_per_sec: 3368.2385 Train throughput/device/tokens_per_sec: 1122.7462 Train throughput/flops_per_sec: 157886485043818.8125 Train throughput/device/flops_per_sec: 52628828347939.6016 Train throughput/device/mfu: 0.1687 Train time/train: 0.0709 Train time/val: 0.0000 Train time/total: 0.0709 Train lr-DecoupledAdamW/group0: 0.0000 Train lr-DecoupledAdamW/group1: 0.0688 Train lr-DecoupledAdamW/group2: -0.0688 [E ProcessGroupNCCL.cpp:828] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3777, OpType=_ALLGATHER_BASE, Timeout(ms)=1800000) ran for 1802129 milliseconds before timing out. [E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down. terminate called after throwing an instance of 'std::runtime_error' what(): [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3777, OpType=_ALLGATHER_BASE, Timeout(ms)=1800000) ran for 1802129 milliseconds before timing out.

这是因为样例测试集的数据量很少，23个batch之后某个数据用光了，某张卡上训练停止了，你需要处理原始的redpajama来满足数据要求

princeton-nlp / LLM-Shearing

在进行Building trainer时，训练会卡住； #46