lm-sys / FastChat

An open platform for training, serving, and evaluating large language models. Release repo for Vicuna and Chatbot Arena.
Apache License 2.0
35.65k stars 4.38k forks source link

multiprocessing train error #555

Open landerson85 opened 1 year ago

landerson85 commented 1 year ago

WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3765 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3766 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -7) local_rank: 2 (pid: 3767) of binary: /usr/bin/python3 Traceback (most recent call last): File "/usr/local/bin/torchrun", line 8, in sys.exit(main()) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 794, in main run(args) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 785, in run elastic_launch( File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 134, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

train run arges

torchrun --nnodes=1 --nproc_per_node=3 --master_port=20001 train/train.py \ --model_name_or_path /data/model/vicuna/vicuna-7b \ --data_path playground/data/dummy.json \ --bf16 True \ --output_dir /data/app/output \ --num_train_epochs 3 \ --per_device_train_batch_size 4 \ --per_device_eval_batch_size 4 \ --gradient_accumulation_steps 8 \ --evaluation_strategy "no" \ --save_strategy "steps" \ --save_steps 1200 \ --save_total_limit 10 \ --learning_rate 2e-5 \ --weight_decay 0. \ --warmup_ratio 0.03 \ --lr_scheduler_type "cosine" \ --logging_steps 1 \ --logging_dir "/data/app/output" \ --fsdp "full_shard auto_wrap" \ --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \ --tf32 True \ --model_max_length 2048 \ --gradient_checkpointing True \ --lazy_preprocess False

landerson85 commented 1 year ago

Pytorch version: 2.0.0+cu118 CUDA Version: 11.8 cuDNN version is : 8700 CUDA HOME: /usr/local/cuda Available GPUs: 3

nouf01 commented 6 months ago

Same error did you manage to solve it?