Closed xay2001 closed 1 month ago
How to solve it
Update steps: 1%|▏ | 87/10000 [04:05<7:45:49, 2.82s/it] 2024-10-15 11:29:38.957 | INFO | __main__:main:519 - Saving model and optimizer to checkpoints/llama_60m-2024-10-15-11-25-20/model_87, update step 87 [2024-10-15 11:29:39,164] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect) /home/xay/.conda/envs/owlore/lib/python3.9/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning:
torch.cuda.amp.custom_fwd(args...)is deprecated. Please use
torch.amp.custom_fwd(args..., device_type='cuda')instead. def forward(ctx, input, weight, bias=None): /home/xay/.conda/envs/owlore/lib/python3.9/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning:
torch.cuda.amp.custom_bwd(args...)is deprecated. Please use
torch.amp.custom_bwd(args..., device_type='cuda')instead. def backward(ctx, grad_output): [rank0]: Traceback (most recent call last): [rank0]: File "/home/xay/.conda/envs/owlore/lib/python3.9/site-packages/transformers/generation/configuration_utils.py", line 771, in save_pretrained [rank0]: raise ValueError(str([w.message for w in caught_warnings])) [rank0]: ValueError: [UserWarning('
pad_token_idshould be positive but got -1. This will cause errors when batch generating, if there is padding. Please set
pad_token_idexplicitly by
model.generation_config.pad_token_id=PAD_TOKEN_IDto avoid errors in generation, and ensure your
input_ids` input does not have negative values.')][rank0]: During handling of the above exception, another exception occurred:
[rank0]: Traceback (most recent call last): [rank0]: File "/home/xay/GaLore/torchrun_main.py", line 571, in [rank0]: main(args) [rank0]: File "/home/xay/GaLore/torchrun_main.py", line 521, in main [rank0]: model.module.save_pretrained(current_model_directory) [rank0]: File "/home/xay/.conda/envs/owlore/lib/python3.9/site-packages/transformers/modeling_utils.py", line 2571, in save_pretrained [rank0]: model_to_save.generation_config.save_pretrained(save_directory) [rank0]: File "/home/xay/.conda/envs/owlore/lib/python3.9/site-packages/transformers/generation/configuration_utils.py", line 773, in save_pretrained [rank0]: raise ValueError( [rank0]: ValueError: The generation config instance is invalid --
.validate()
throws warnings and/or exceptions. Fix these issues to save the configuration.[rank0]: Thrown during validation:
[rank0]: [UserWarning('
pad_token_id
should be positive but got -1. This will cause errors when batch generating, if there is padding. Please setpad_token_id
explicitly bymodel.generation_config.pad_token_id=PAD_TOKEN_ID
to avoid errors in generation, and ensure yourinput_ids
input does not have negative values.')] wandb: wandb: You can sync this run to the cloud by running: wandb: wandb sync /home/xay/GaLore/wandb/offline-run-20241015_112518-ftuz697y wandb: Find logs at: wandb/offline-run-20241015_112518-ftuz697y/logs E1015 11:29:45.294541 140031983350976 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 2990010) of binary: /home/xay/.conda/envs/owlore/bin/python Traceback (most recent call last): File "/home/xay/.conda/envs/owlore/bin/torchrun", line 8, in sys.exit(main()) File "/home/xay/.conda/envs/owlore/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 348, in wrapper return f(*args, kwargs) File "/home/xay/.conda/envs/owlore/lib/python3.9/site-packages/torch/distributed/run.py", line 901, in main run(args) File "/home/xay/.conda/envs/owlore/lib/python3.9/site-packages/torch/distributed/run.py", line 892, in run elastic_launch( File "/home/xay/.conda/envs/owlore/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 133, in call** return launch_agent(self._config, self._entrypoint, list(args)) File "/home/xay/.conda/envs/owlore/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:torchrun_main.py FAILED
Failures:
Root Cause (first observed failure): [0]: time : 2024-10-15_11:29:45 host : manager.example.com rank : 0 (local_rank: 0) exitcode : 1 (pid: 2990010) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================`
i want to know how do you solve it
How to solve it
Update steps: 1%|▏ | 87/10000 [04:05<7:45:49, 2.82s/it] 2024-10-15 11:29:38.957 | INFO | __main__:main:519 - Saving model and optimizer to checkpoints/llama_60m-2024-10-15-11-25-20/model_87, update step 87 [2024-10-15 11:29:39,164] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect) /home/xay/.conda/envs/owlore/lib/python3.9/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning:
torch.cuda.amp.custom_fwd(args...)is deprecated. Please use
torch.amp.custom_fwd(args..., device_type='cuda')instead. def forward(ctx, input, weight, bias=None): /home/xay/.conda/envs/owlore/lib/python3.9/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning:
torch.cuda.amp.custom_bwd(args...)is deprecated. Please use
torch.amp.custom_bwd(args..., device_type='cuda')instead. def backward(ctx, grad_output): [rank0]: Traceback (most recent call last): [rank0]: File "/home/xay/.conda/envs/owlore/lib/python3.9/site-packages/transformers/generation/configuration_utils.py", line 771, in save_pretrained [rank0]: raise ValueError(str([w.message for w in caught_warnings])) [rank0]: ValueError: [UserWarning('
pad_token_idshould be positive but got -1. This will cause errors when batch generating, if there is padding. Please set
pad_token_idexplicitly by
model.generation_config.pad_token_id=PAD_TOKEN_IDto avoid errors in generation, and ensure your
input_ids` input does not have negative values.')]rank0: During handling of the above exception, another exception occurred:
rank0: Traceback (most recent call last): rank0: File "/home/xay/GaLore/torchrun_main.py", line 571, in
rank0: File "/home/xay/GaLore/torchrun_main.py", line 521, in main
rank0: File "/home/xay/.conda/envs/owlore/lib/python3.9/site-packages/transformers/modeling_utils.py", line 2571, in save_pretrained
rank0: File "/home/xay/.conda/envs/owlore/lib/python3.9/site-packages/transformers/generation/configuration_utils.py", line 773, in save_pretrained rank0: raise ValueError( rank0: ValueError: The generation config instance is invalid --
.validate()
throws warnings and/or exceptions. Fix these issues to save the configuration.rank0: Thrown during validation: rank0: [UserWarning('
sys.exit(main())
File "/home/xay/.conda/envs/owlore/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 348, in wrapper
return f(*args, **kwargs)
File "/home/xay/.conda/envs/owlore/lib/python3.9/site-packages/torch/distributed/run.py", line 901, in main
run(args)
File "/home/xay/.conda/envs/owlore/lib/python3.9/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/home/xay/.conda/envs/owlore/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 133, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/xay/.conda/envs/owlore/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
pad_token_id
should be positive but got -1. This will cause errors when batch generating, if there is padding. Please setpad_token_id
explicitly bymodel.generation_config.pad_token_id=PAD_TOKEN_ID
to avoid errors in generation, and ensure yourinput_ids
input does not have negative values.')] wandb: wandb: You can sync this run to the cloud by running: wandb: wandb sync /home/xay/GaLore/wandb/offline-run-20241015_112518-ftuz697y wandb: Find logs at: wandb/offline-run-20241015_112518-ftuz697y/logs E1015 11:29:45.294541 140031983350976 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 2990010) of binary: /home/xay/.conda/envs/owlore/bin/python Traceback (most recent call last): File "/home/xay/.conda/envs/owlore/bin/torchrun", line 8, intorchrun_main.py FAILED
Failures: