使用baichuan2 pre 错误 torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

sunshineyg2018 commented 1 year ago

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5 torchrun --nproc_per_node 6 pretraining.py \ --model_type baichuan \ --model_name_or_path /root/autodl-tmp/baichuan2_inc_13 \ --train_file_dir /root/autodl-tmp/corpus \ --validation_file_dir /root/autodl-tmp/corpus \ --per_device_train_batch_size 1 \ --per_device_eval_batch_size 1 \ --do_train \ --do_eval \ --use_peft True \ --seed 42 \ --fp16 \ --load_in_8bit True \ --max_train_samples -1 \ --max_eval_samples -1 \ --num_train_epochs 0.5 \ --learning_rate 2e-4 \ --warmup_ratio 0.05 \ --weight_decay 0.01 \ --logging_strategy steps \ --logging_steps 10 \ --eval_steps 50 \ --evaluation_strategy steps \ --save_steps 500 \ --save_strategy steps \ --save_total_limit 3 \ --gradient_accumulation_steps 1 \ --preprocessing_num_workers 1 \ --block_size 1024 \ --output_dir /root/autodl-tmp/outputs_pt_baichuan2_13_v1 \ --overwrite_output_dir \ --ddp_timeout 30000 \ --logging_first_step True \ --target_modules all \ --lora_rank 8 \ --lora_alpha 16 \ --lora_dropout 0.05 \ --torch_dtype float16 \ --device_map auto \ --report_to tensorboard \ --ddp_find_unused_parameters False \ --gradient_checkpointing True \ --cache_dir ./cache

提示如下错误

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 25827) of binary: /root/miniconda3/envs/train/bin/python
Traceback (most recent call last):
  File "/root/miniconda3/envs/train/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/root/miniconda3/envs/train/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/root/miniconda3/envs/train/lib/python3.8/site-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/root/miniconda3/envs/train/lib/python3.8/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/root/miniconda3/envs/train/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/root/miniconda3/envs/train/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
pretraining.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-09-20_10:59:04
  host      : autodl-container-c1cd47b9b8-7da8de10
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 25827)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

是不是因为使用baichuan2 的模型？

shibing624 commented 1 year ago

单卡能跑吗？是不是显存不够

sunshineyg2018 commented 1 year ago

单卡能跑吗？还不是显着不足

你的意思是显存不足吗？单卡是24g显存使用lora 和 8int

shibing624 commented 1 year ago

我测试baichuan2-13b需要25G显存以上才能跑。

stale[bot] commented 9 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.(由于长期不活动，机器人自动关闭此问题，如果需要欢迎提问)

LanShanPi commented 9 months ago

咋样，解决没，我是80G的，也报这个

LanShanPi commented 9 months ago

用的chatglm2，6b的

shibing624 / MedicalGPT

使用baichuan2 pre 错误 torch.distributed.elastic.multiprocessing.errors.ChildFailedError: #216