jiaweizzhao / GaLore

GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection
Apache License 2.0
1.43k stars 148 forks source link

pad_token_id #63

Closed xay2001 closed 1 month ago

xay2001 commented 1 month ago

How to solve it Update steps: 1%|▏ | 87/10000 [04:05<7:45:49, 2.82s/it] 2024-10-15 11:29:38.957 | INFO | __main__:main:519 - Saving model and optimizer to checkpoints/llama_60m-2024-10-15-11-25-20/model_87, update step 87 [2024-10-15 11:29:39,164] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect) /home/xay/.conda/envs/owlore/lib/python3.9/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning:torch.cuda.amp.custom_fwd(args...)is deprecated. Please usetorch.amp.custom_fwd(args..., device_type='cuda')instead. def forward(ctx, input, weight, bias=None): /home/xay/.conda/envs/owlore/lib/python3.9/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning:torch.cuda.amp.custom_bwd(args...)is deprecated. Please usetorch.amp.custom_bwd(args..., device_type='cuda')instead. def backward(ctx, grad_output): [rank0]: Traceback (most recent call last): [rank0]: File "/home/xay/.conda/envs/owlore/lib/python3.9/site-packages/transformers/generation/configuration_utils.py", line 771, in save_pretrained [rank0]: raise ValueError(str([w.message for w in caught_warnings])) [rank0]: ValueError: [UserWarning('pad_token_idshould be positive but got -1. This will cause errors when batch generating, if there is padding. Please setpad_token_idexplicitly bymodel.generation_config.pad_token_id=PAD_TOKEN_IDto avoid errors in generation, and ensure yourinput_ids` input does not have negative values.')]

rank0: During handling of the above exception, another exception occurred:

rank0: Traceback (most recent call last): rank0: File "/home/xay/GaLore/torchrun_main.py", line 571, in

rank0: File "/home/xay/GaLore/torchrun_main.py", line 521, in main

rank0: File "/home/xay/.conda/envs/owlore/lib/python3.9/site-packages/transformers/modeling_utils.py", line 2571, in save_pretrained

rank0: File "/home/xay/.conda/envs/owlore/lib/python3.9/site-packages/transformers/generation/configuration_utils.py", line 773, in save_pretrained rank0: raise ValueError( rank0: ValueError: The generation config instance is invalid -- .validate() throws warnings and/or exceptions. Fix these issues to save the configuration.

rank0: Thrown during validation: rank0: [UserWarning('pad_token_id should be positive but got -1. This will cause errors when batch generating, if there is padding. Please set pad_token_id explicitly by model.generation_config.pad_token_id=PAD_TOKEN_ID to avoid errors in generation, and ensure your input_ids input does not have negative values.')] wandb: wandb: You can sync this run to the cloud by running: wandb: wandb sync /home/xay/GaLore/wandb/offline-run-20241015_112518-ftuz697y wandb: Find logs at: wandb/offline-run-20241015_112518-ftuz697y/logs E1015 11:29:45.294541 140031983350976 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 2990010) of binary: /home/xay/.conda/envs/owlore/bin/python Traceback (most recent call last): File "/home/xay/.conda/envs/owlore/bin/torchrun", line 8, in sys.exit(main()) File "/home/xay/.conda/envs/owlore/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 348, in wrapper return f(*args, **kwargs) File "/home/xay/.conda/envs/owlore/lib/python3.9/site-packages/torch/distributed/run.py", line 901, in main run(args) File "/home/xay/.conda/envs/owlore/lib/python3.9/site-packages/torch/distributed/run.py", line 892, in run elastic_launch( File "/home/xay/.conda/envs/owlore/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 133, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/xay/.conda/envs/owlore/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

torchrun_main.py FAILED

Failures:

------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-10-15_11:29:45 host : manager.example.com rank : 0 (local_rank: 0) exitcode : 1 (pid: 2990010) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================`
itongggg commented 1 week ago

How to solve it Update steps: 1%|▏ | 87/10000 [04:05<7:45:49, 2.82s/it] 2024-10-15 11:29:38.957 | INFO | __main__:main:519 - Saving model and optimizer to checkpoints/llama_60m-2024-10-15-11-25-20/model_87, update step 87 [2024-10-15 11:29:39,164] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect) /home/xay/.conda/envs/owlore/lib/python3.9/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning:torch.cuda.amp.custom_fwd(args...)is deprecated. Please usetorch.amp.custom_fwd(args..., device_type='cuda')instead. def forward(ctx, input, weight, bias=None): /home/xay/.conda/envs/owlore/lib/python3.9/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning:torch.cuda.amp.custom_bwd(args...)is deprecated. Please usetorch.amp.custom_bwd(args..., device_type='cuda')instead. def backward(ctx, grad_output): [rank0]: Traceback (most recent call last): [rank0]: File "/home/xay/.conda/envs/owlore/lib/python3.9/site-packages/transformers/generation/configuration_utils.py", line 771, in save_pretrained [rank0]: raise ValueError(str([w.message for w in caught_warnings])) [rank0]: ValueError: [UserWarning('pad_token_idshould be positive but got -1. This will cause errors when batch generating, if there is padding. Please setpad_token_idexplicitly bymodel.generation_config.pad_token_id=PAD_TOKEN_IDto avoid errors in generation, and ensure yourinput_ids` input does not have negative values.')]

[rank0]: During handling of the above exception, another exception occurred:

[rank0]: Traceback (most recent call last): [rank0]: File "/home/xay/GaLore/torchrun_main.py", line 571, in [rank0]: main(args) [rank0]: File "/home/xay/GaLore/torchrun_main.py", line 521, in main [rank0]: model.module.save_pretrained(current_model_directory) [rank0]: File "/home/xay/.conda/envs/owlore/lib/python3.9/site-packages/transformers/modeling_utils.py", line 2571, in save_pretrained [rank0]: model_to_save.generation_config.save_pretrained(save_directory) [rank0]: File "/home/xay/.conda/envs/owlore/lib/python3.9/site-packages/transformers/generation/configuration_utils.py", line 773, in save_pretrained [rank0]: raise ValueError( [rank0]: ValueError: The generation config instance is invalid -- .validate() throws warnings and/or exceptions. Fix these issues to save the configuration.

[rank0]: Thrown during validation:

[rank0]: [UserWarning('pad_token_id should be positive but got -1. This will cause errors when batch generating, if there is padding. Please set pad_token_id explicitly by model.generation_config.pad_token_id=PAD_TOKEN_ID to avoid errors in generation, and ensure your input_ids input does not have negative values.')] wandb: wandb: You can sync this run to the cloud by running: wandb: wandb sync /home/xay/GaLore/wandb/offline-run-20241015_112518-ftuz697y wandb: Find logs at: wandb/offline-run-20241015_112518-ftuz697y/logs E1015 11:29:45.294541 140031983350976 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 2990010) of binary: /home/xay/.conda/envs/owlore/bin/python Traceback (most recent call last): File "/home/xay/.conda/envs/owlore/bin/torchrun", line 8, in sys.exit(main()) File "/home/xay/.conda/envs/owlore/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 348, in wrapper return f(*args, kwargs) File "/home/xay/.conda/envs/owlore/lib/python3.9/site-packages/torch/distributed/run.py", line 901, in main run(args) File "/home/xay/.conda/envs/owlore/lib/python3.9/site-packages/torch/distributed/run.py", line 892, in run elastic_launch( File "/home/xay/.conda/envs/owlore/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 133, in call** return launch_agent(self._config, self._entrypoint, list(args)) File "/home/xay/.conda/envs/owlore/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

torchrun_main.py FAILED

Failures:

Root Cause (first observed failure): [0]: time : 2024-10-15_11:29:45 host : manager.example.com rank : 0 (local_rank: 0) exitcode : 1 (pid: 2990010) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================`

i want to know how do you solve it