princeton-nlp / LLM-Shearing

[ICLR 2024] Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning
https://arxiv.org/abs/2310.06694
MIT License
533 stars 39 forks source link

Train metrics/train/github_LanguageCrossEntropy: nan #31

Closed lippman1125 closed 9 months ago

lippman1125 commented 9 months ago

During training stage of prunning, Train metrics/train/github_LanguageCrossEntropy is nan, is that normal?

[batch=189/3200]:
Train time/batch: 188
Train time/sample: 6016
Train time/batch_in_epoch: 188
Train time/sample_in_epoch: 6016
Train time/token: 24641536
Train time/token_in_epoch: 24641536
Train metrics/train/cc_weight: 0.6176 Train metrics/train/github_weight: 0.0408
Train metrics/train/book_weight: 0.0441 Train metrics/train/stackexchange_weight: 0.0168 Train metrics/train/wiki_weight: 0.0861 Train metrics/train/arxiv_weight: 0.0189
Train metrics/train/c4-rp_weight: 0.1757
Train memory/current_allocated_mem: 14.6140 Train memory/current_active_mem: 14.6140
Train memory/current_inactive_mem: 1.9258
Train memory/current_reserved_mem: 43.4220
Train memory/peak_allocated_mem: 28.0710 Train memory/peak_active_mem: 28.0710 Train memory/peak_inactive_mem: 11.7290 Train memory/peak_reserved_mem: 43.4220 Train memory/alloc_retries: 0 Train metrics/train/expected_head_sparsity: 0.0132 Train metrics/train/target_head_sparsity: 0.1102 Train metrics/train/expected_intermediate_sparsity: 0.0057 Train metrics/train/target_intermediate_sparsity: 0.1093 Train metrics/train/expected_layer_sparsity: 0.0039 Train metrics/train/target_layer_sparsity: 0.0000 Train metrics/train/expected_hidden_sparsity: 0.1882 Train metrics/train/target_hidden_sparsity: 0.1102 Train metrics/train/expected_sparsity: 0.1981 Train metrics/train/target_sparsity: 0.1786 Train trainer/device_train_microbatch_size: 4 Train loss/train/total: 3.9353 Train loss/train/ce_loss: 2.3241 Train loss/train/lag_loss: 1.6112 Train metrics/train/LanguageCrossEntropy: 2.3241 Train metrics/train/Perplexity: 10.2176 Train metrics/train/cc_LanguageCrossEntropy: 2.2752 Train metrics/train/cc_count: 3991 Train metrics/train/github_LanguageCrossEntropy: nan Train metrics/train/github_count: 276 Train metrics/train/book_LanguageCrossEntropy: nan

xiamengzhou commented 9 months ago

Hi! When the weight of a domain is too low, it's possible that at the current batch the data in that domain is not selected. So it's normal.

lippman1125 commented 9 months ago

Got it, thanks!