princeton-nlp / LLM-Shearing

[ICLR 2024] Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning
https://arxiv.org/abs/2310.06694
MIT License
533 stars 39 forks source link

Path no use in continue_pretrain.sh #24

Closed Longyichen closed 9 months ago

Longyichen commented 9 months ago

https://github.com/princeton-nlp/LLM-Shearing/blob/8bb2f7c6b494edba50e52ee70ac334ec315cc43a/llmshearing/scripts/continue_pretrain.sh#L14C1-L14C5

Path is defined but not used. Where to pass the parameters of the model?

xiamengzhou commented 9 months ago

Thanks for catching this! Scripts are updated.

Longyichen commented 9 months ago

Hi mengzhou, i change the code and it prints load weight from my path, its ok

but it raise a problem that the loss keep all the same

It seems that the gradient is not calculated and it is not training normally. And the loss (10) of the hot start after loading is much higher than the loss (2) I pruned before, on the same data set. Is this a normal phenomenon?

[batch=366/48000]:
         Train time/batch: 365
         Train time/sample: 93440
         Train time/batch_in_epoch: 365
         Train time/sample_in_epoch: 93440
         Train time/token: 382730240
         Train time/token_in_epoch: 382730240
         Train metrics/train/cc_weight: 0.2192
         Train metrics/train/github_weight: 0.0002
         Train metrics/train/book_weight: 0.0791
         Train metrics/train/stackexchange_weight: 0.0064
         Train metrics/train/wiki_weight: 0.0096
         Train metrics/train/arxiv_weight: 0.0010
         Train metrics/train/c4-rp_weight: 0.6845
         Train memory/current_allocated_mem: 9.7173
         Train memory/current_active_mem: 9.7173
         Train memory/current_inactive_mem: 0.6447
         Train memory/current_reserved_mem: 51.3280
         Train memory/peak_allocated_mem: 44.6420
         Train memory/peak_active_mem: 44.8020
         Train memory/peak_inactive_mem: 17.7940
         Train memory/peak_reserved_mem: 51.3280
         Train memory/alloc_retries: 0
         Train trainer/device_train_microbatch_size: 16
         Train loss/train/total: 10.3750
         Train loss/train/ce_loss: 10.3750
         Train metrics/train/LanguageCrossEntropy: 10.3750
         Train metrics/train/Perplexity: 32048.3164
         Train metrics/train/cc_LanguageCrossEntropy: 10.3750
         Train metrics/train/cc_count: 746
         Train metrics/train/github_LanguageCrossEntropy: nan
         Train metrics/train/github_count: 0
         Train metrics/train/book_LanguageCrossEntropy: 10.3750
         Train metrics/train/book_count: 250
         Train metrics/train/stackexchange_LanguageCrossEntropy: 10.3750
         Train metrics/train/stackexchange_count: 18
         Train metrics/train/wiki_LanguageCrossEntropy: 10.3750
         Train metrics/train/wiki_count: 33
         Train metrics/train/arxiv_LanguageCrossEntropy: nan
         Train metrics/train/arxiv_count: 2
         Train metrics/train/c4-rp_LanguageCrossEntropy: 10.3750
         Train metrics/train/c4-rp_count: 2279
         Train throughput/batches_per_sec: 0.0414
         Train throughput/samples_per_sec: 10.5889
         Train throughput/device/batches_per_sec: 0.0052
         Train throughput/device/samples_per_sec: 1.3236
         Train throughput/tokens_per_sec: 43372.1727
         Train throughput/device/tokens_per_sec: 5421.5216
         Train throughput/flops_per_sec: 877674723199506.0000
         Train throughput/device/flops_per_sec: 109709340399938.2500
         Train throughput/device/mfu: 0.3516
         Train time/train: 2.4628
         Train time/val: 0.0000
         Train time/total: 2.4628
[batch=367/48000]:
         Train time/batch: 366
         Train time/sample: 93696
         Train time/batch_in_epoch: 366
         Train time/sample_in_epoch: 93696
         Train time/token: 383778816
         Train time/token_in_epoch: 383778816
         Train metrics/train/cc_weight: 0.2192
         Train metrics/train/github_weight: 0.0002
         Train metrics/train/book_weight: 0.0791
         Train metrics/train/stackexchange_weight: 0.0064
         Train metrics/train/wiki_weight: 0.0096
         Train metrics/train/arxiv_weight: 0.0010
         Train metrics/train/c4-rp_weight: 0.6845
         Train memory/current_allocated_mem: 9.7173
         Train memory/current_active_mem: 9.7173
         Train memory/current_inactive_mem: 0.6447
         Train memory/current_reserved_mem: 51.3280
         Train memory/peak_allocated_mem: 44.6420
         Train memory/peak_active_mem: 44.8020
         Train memory/peak_inactive_mem: 17.7940
         Train memory/peak_reserved_mem: 51.3280
         Train memory/alloc_retries: 0
         Train trainer/device_train_microbatch_size: 16
         Train loss/train/total: 10.3750
         Train loss/train/ce_loss: 10.3750
         Train metrics/train/LanguageCrossEntropy: 10.3750
         Train metrics/train/Perplexity: 32048.3164
         Train metrics/train/cc_LanguageCrossEntropy: 10.3750
         Train metrics/train/cc_count: 803
         Train metrics/train/github_LanguageCrossEntropy: nan
         Train metrics/train/github_count: 0
         Train metrics/train/book_LanguageCrossEntropy: 10.3750
         Train metrics/train/book_count: 273
         Train metrics/train/stackexchange_LanguageCrossEntropy: 10.3750
         Train metrics/train/stackexchange_count: 20
         Train metrics/train/wiki_LanguageCrossEntropy: 10.3750
         Train metrics/train/wiki_count: 35
         Train metrics/train/arxiv_LanguageCrossEntropy: 10.3750
         Train metrics/train/arxiv_count: 3
         Train metrics/train/c4-rp_LanguageCrossEntropy: 10.3750
         Train metrics/train/c4-rp_count: 2450
         Train throughput/batches_per_sec: 0.0413
         Train throughput/samples_per_sec: 10.5839
         Train throughput/device/batches_per_sec: 0.0052
         Train throughput/device/samples_per_sec: 1.3230
         Train throughput/tokens_per_sec: 43351.5384
         Train throughput/device/tokens_per_sec: 5418.9423
         Train throughput/flops_per_sec: 877257170181446.8750
         Train throughput/device/flops_per_sec: 109657146272680.8594
         Train throughput/device/mfu: 0.3515
         Train time/train: 2.4695
         Train time/val: 0.0000
         Train time/total: 2.4695
[batch=368/48000]:
         Train time/batch: 367
         Train time/sample: 93952
         Train time/batch_in_epoch: 367
         Train time/sample_in_epoch: 93952
         Train time/token: 384827392
         Train time/token_in_epoch: 384827392
         Train metrics/train/cc_weight: 0.2192
         Train metrics/train/github_weight: 0.0002
         Train metrics/train/book_weight: 0.0791
         Train metrics/train/stackexchange_weight: 0.0064
         Train metrics/train/wiki_weight: 0.0096
         Train metrics/train/arxiv_weight: 0.0010
         Train metrics/train/c4-rp_weight: 0.6845
         Train memory/current_allocated_mem: 9.7173
         Train memory/current_active_mem: 9.7173
         Train memory/current_inactive_mem: 0.6447
         Train memory/current_reserved_mem: 51.3280
         Train memory/peak_allocated_mem: 44.6430
         Train memory/peak_active_mem: 44.8020
         Train memory/peak_inactive_mem: 17.7940
         Train memory/peak_reserved_mem: 51.3280
         Train memory/alloc_retries: 0
         Train trainer/device_train_microbatch_size: 16
         Train loss/train/total: 10.3750
         Train loss/train/ce_loss: 10.3750
         Train metrics/train/LanguageCrossEntropy: 10.3750
         Train metrics/train/Perplexity: 32048.3164
         Train metrics/train/cc_LanguageCrossEntropy: 10.3750
         Train metrics/train/cc_count: 861
         Train metrics/train/github_LanguageCrossEntropy: nan
         Train metrics/train/github_count: 0
         Train metrics/train/book_LanguageCrossEntropy: 10.3750
         Train metrics/train/book_count: 293
         Train metrics/train/stackexchange_LanguageCrossEntropy: 10.3750
         Train metrics/train/stackexchange_count: 22
         Train metrics/train/wiki_LanguageCrossEntropy: 10.3750
         Train metrics/train/wiki_count: 40
         Train metrics/train/arxiv_LanguageCrossEntropy: 10.3750
         Train metrics/train/arxiv_count: 4
         Train metrics/train/c4-rp_LanguageCrossEntropy: 10.3750
         Train metrics/train/c4-rp_count: 2620
         Train throughput/batches_per_sec: 0.0412
         Train throughput/samples_per_sec: 10.5394
         Train throughput/device/batches_per_sec: 0.0051
         Train throughput/device/samples_per_sec: 1.3174
         Train throughput/tokens_per_sec: 43169.4561
         Train throughput/device/tokens_per_sec: 5396.1820
         Train throughput/flops_per_sec: 873572571549509.1250
         Train throughput/device/flops_per_sec: 109196571443688.6406
         Train throughput/device/mfu: 0.3500
         Train time/train: 2.4766
         Train time/val: 0.0000
         Train time/total: 2.4766
Longyichen commented 9 months ago

Hi mengzhou, i change the code and it prints load weight from my path, its ok

but it raise a problem that the loss keep all the same

It seems that the gradient is not calculated and it is not training normally. And the loss (10) of the hot start after loading is much higher than the loss (2) I pruned before, on the same data set. Is this a normal phenomenon?

[batch=366/48000]:
         Train time/batch: 365
         Train time/sample: 93440
         Train time/batch_in_epoch: 365
         Train time/sample_in_epoch: 93440
         Train time/token: 382730240
         Train time/token_in_epoch: 382730240
         Train metrics/train/cc_weight: 0.2192
         Train metrics/train/github_weight: 0.0002
         Train metrics/train/book_weight: 0.0791
         Train metrics/train/stackexchange_weight: 0.0064
         Train metrics/train/wiki_weight: 0.0096
         Train metrics/train/arxiv_weight: 0.0010
         Train metrics/train/c4-rp_weight: 0.6845
         Train memory/current_allocated_mem: 9.7173
         Train memory/current_active_mem: 9.7173
         Train memory/current_inactive_mem: 0.6447
         Train memory/current_reserved_mem: 51.3280
         Train memory/peak_allocated_mem: 44.6420
         Train memory/peak_active_mem: 44.8020
         Train memory/peak_inactive_mem: 17.7940
         Train memory/peak_reserved_mem: 51.3280
         Train memory/alloc_retries: 0
         Train trainer/device_train_microbatch_size: 16
         Train loss/train/total: 10.3750
         Train loss/train/ce_loss: 10.3750
         Train metrics/train/LanguageCrossEntropy: 10.3750
         Train metrics/train/Perplexity: 32048.3164
         Train metrics/train/cc_LanguageCrossEntropy: 10.3750
         Train metrics/train/cc_count: 746
         Train metrics/train/github_LanguageCrossEntropy: nan
         Train metrics/train/github_count: 0
         Train metrics/train/book_LanguageCrossEntropy: 10.3750
         Train metrics/train/book_count: 250
         Train metrics/train/stackexchange_LanguageCrossEntropy: 10.3750
         Train metrics/train/stackexchange_count: 18
         Train metrics/train/wiki_LanguageCrossEntropy: 10.3750
         Train metrics/train/wiki_count: 33
         Train metrics/train/arxiv_LanguageCrossEntropy: nan
         Train metrics/train/arxiv_count: 2
         Train metrics/train/c4-rp_LanguageCrossEntropy: 10.3750
         Train metrics/train/c4-rp_count: 2279
         Train throughput/batches_per_sec: 0.0414
         Train throughput/samples_per_sec: 10.5889
         Train throughput/device/batches_per_sec: 0.0052
         Train throughput/device/samples_per_sec: 1.3236
         Train throughput/tokens_per_sec: 43372.1727
         Train throughput/device/tokens_per_sec: 5421.5216
         Train throughput/flops_per_sec: 877674723199506.0000
         Train throughput/device/flops_per_sec: 109709340399938.2500
         Train throughput/device/mfu: 0.3516
         Train time/train: 2.4628
         Train time/val: 0.0000
         Train time/total: 2.4628
[batch=367/48000]:
         Train time/batch: 366
         Train time/sample: 93696
         Train time/batch_in_epoch: 366
         Train time/sample_in_epoch: 93696
         Train time/token: 383778816
         Train time/token_in_epoch: 383778816
         Train metrics/train/cc_weight: 0.2192
         Train metrics/train/github_weight: 0.0002
         Train metrics/train/book_weight: 0.0791
         Train metrics/train/stackexchange_weight: 0.0064
         Train metrics/train/wiki_weight: 0.0096
         Train metrics/train/arxiv_weight: 0.0010
         Train metrics/train/c4-rp_weight: 0.6845
         Train memory/current_allocated_mem: 9.7173
         Train memory/current_active_mem: 9.7173
         Train memory/current_inactive_mem: 0.6447
         Train memory/current_reserved_mem: 51.3280
         Train memory/peak_allocated_mem: 44.6420
         Train memory/peak_active_mem: 44.8020
         Train memory/peak_inactive_mem: 17.7940
         Train memory/peak_reserved_mem: 51.3280
         Train memory/alloc_retries: 0
         Train trainer/device_train_microbatch_size: 16
         Train loss/train/total: 10.3750
         Train loss/train/ce_loss: 10.3750
         Train metrics/train/LanguageCrossEntropy: 10.3750
         Train metrics/train/Perplexity: 32048.3164
         Train metrics/train/cc_LanguageCrossEntropy: 10.3750
         Train metrics/train/cc_count: 803
         Train metrics/train/github_LanguageCrossEntropy: nan
         Train metrics/train/github_count: 0
         Train metrics/train/book_LanguageCrossEntropy: 10.3750
         Train metrics/train/book_count: 273
         Train metrics/train/stackexchange_LanguageCrossEntropy: 10.3750
         Train metrics/train/stackexchange_count: 20
         Train metrics/train/wiki_LanguageCrossEntropy: 10.3750
         Train metrics/train/wiki_count: 35
         Train metrics/train/arxiv_LanguageCrossEntropy: 10.3750
         Train metrics/train/arxiv_count: 3
         Train metrics/train/c4-rp_LanguageCrossEntropy: 10.3750
         Train metrics/train/c4-rp_count: 2450
         Train throughput/batches_per_sec: 0.0413
         Train throughput/samples_per_sec: 10.5839
         Train throughput/device/batches_per_sec: 0.0052
         Train throughput/device/samples_per_sec: 1.3230
         Train throughput/tokens_per_sec: 43351.5384
         Train throughput/device/tokens_per_sec: 5418.9423
         Train throughput/flops_per_sec: 877257170181446.8750
         Train throughput/device/flops_per_sec: 109657146272680.8594
         Train throughput/device/mfu: 0.3515
         Train time/train: 2.4695
         Train time/val: 0.0000
         Train time/total: 2.4695
[batch=368/48000]:
         Train time/batch: 367
         Train time/sample: 93952
         Train time/batch_in_epoch: 367
         Train time/sample_in_epoch: 93952
         Train time/token: 384827392
         Train time/token_in_epoch: 384827392
         Train metrics/train/cc_weight: 0.2192
         Train metrics/train/github_weight: 0.0002
         Train metrics/train/book_weight: 0.0791
         Train metrics/train/stackexchange_weight: 0.0064
         Train metrics/train/wiki_weight: 0.0096
         Train metrics/train/arxiv_weight: 0.0010
         Train metrics/train/c4-rp_weight: 0.6845
         Train memory/current_allocated_mem: 9.7173
         Train memory/current_active_mem: 9.7173
         Train memory/current_inactive_mem: 0.6447
         Train memory/current_reserved_mem: 51.3280
         Train memory/peak_allocated_mem: 44.6430
         Train memory/peak_active_mem: 44.8020
         Train memory/peak_inactive_mem: 17.7940
         Train memory/peak_reserved_mem: 51.3280
         Train memory/alloc_retries: 0
         Train trainer/device_train_microbatch_size: 16
         Train loss/train/total: 10.3750
         Train loss/train/ce_loss: 10.3750
         Train metrics/train/LanguageCrossEntropy: 10.3750
         Train metrics/train/Perplexity: 32048.3164
         Train metrics/train/cc_LanguageCrossEntropy: 10.3750
         Train metrics/train/cc_count: 861
         Train metrics/train/github_LanguageCrossEntropy: nan
         Train metrics/train/github_count: 0
         Train metrics/train/book_LanguageCrossEntropy: 10.3750
         Train metrics/train/book_count: 293
         Train metrics/train/stackexchange_LanguageCrossEntropy: 10.3750
         Train metrics/train/stackexchange_count: 22
         Train metrics/train/wiki_LanguageCrossEntropy: 10.3750
         Train metrics/train/wiki_count: 40
         Train metrics/train/arxiv_LanguageCrossEntropy: 10.3750
         Train metrics/train/arxiv_count: 4
         Train metrics/train/c4-rp_LanguageCrossEntropy: 10.3750
         Train metrics/train/c4-rp_count: 2620
         Train throughput/batches_per_sec: 0.0412
         Train throughput/samples_per_sec: 10.5394
         Train throughput/device/batches_per_sec: 0.0051
         Train throughput/device/samples_per_sec: 1.3174
         Train throughput/tokens_per_sec: 43169.4561
         Train throughput/device/tokens_per_sec: 5396.1820
         Train throughput/flops_per_sec: 873572571549509.1250
         Train throughput/device/flops_per_sec: 109196571443688.6406
         Train throughput/device/mfu: 0.3500
         Train time/train: 2.4766
         Train time/val: 0.0000
         Train time/total: 2.4766

I trained on a single card and it can run normally. It seems that there is still a compatibility issue with the composer code on multiple cards. There may be issues with model distributed loading and data sharding. Is it possible to run this code using the deepspeed framework?

xiamengzhou commented 9 months ago

Hey I think there is a bug here -- not sure what it is. I am working on it now.

xiamengzhou commented 9 months ago

Hi the issue is resolved! It stemmed from the init_device setup from the yaml files. Originally it was set to meta, yet it causes unexpected issues in loading the model. Now it is switched to cpu. Ideally, we would love to support meta loading as it is faster, but I am not sure how to integrate it with the current codebase yet. Thanks for spotting this!

PS. The codebase has been changed a lot after my runs for the paper (mostly to make it adaptable for the update to date composer package.) So there could be issues here and there, as it's not fully tested. Thanks for your work on this!

Longyichen commented 9 months ago

Hi mengzhou, Thank you very much for your quick fix, I will try again tomorrow to test your new code. It is normal to have problems in the code, but fortunately, I have basically run through the framework process now. Although the process is a bit difficult, your work is very interesting and meaningful, so I am happy to participate in the realization process. Another happy thing is that I have found a solution to the problem of "train.fit" not reporting errors and multi-card process blocking that bothered me before. I raised an issue in the composer library and found the relevant solution with their help. When the code in this warehouse is interrupted, the shared memory will be blocked by some bugs in Steam. The zombie memory needs to be cleaned up in time to avoid lagging. For specific circumstances, please refer to the following two issues. https://github.com/mosaicml/llm-foundry/issues/436#issuecomment-1627712085 https://github.com/mosaicml/composer/issues/2733 I wrote a script for cleaning. If you need it, I will create a new branch and merge it in.

xiamengzhou commented 9 months ago

Hi! Awesome :) Feel free to start a PR on it!

argitrage commented 9 months ago

Hey @Longyichen , I am also facing issues with pruning getting stuck after 'Starting Training'.

Could you guide me with the changes you made to solve this issue?

Longyichen commented 9 months ago

Hey @Longyichen , I am also facing issues with pruning getting stuck after 'Starting Training'.

Could you guide me with the changes you made to solve this issue?

@argitrage see https://github.com/princeton-nlp/LLM-Shearing/pull/30