[2023-11-20 19:18:31,206] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 2061498) of binary

dshwei commented 10 months ago

Describe the bug

Please provide a clear and concise description of what the bug is. If applicable, add screenshots to help explain your problem, especially for visualization related problems.

CUDA_VISIBLE_DEVICES=1 torchrun --nproc_per_node 1 supervised_finetuning.py \ --model_type chatglm \ --model_name_or_path /home/chatglm3-6b/ \ --train_file_dir ./data/finetune \ --validation_file_dir ./data/finetune \ --per_device_train_batch_size 4 \ --per_device_eval_batch_size 4 \ --do_train \ --do_eval \ --use_peft True \ --fp16 \ --max_train_samples 1000 \ --max_eval_samples 10 \ --model_max_length 1024 \ --num_train_epochs 1 \ --learning_rate 2e-5 \ --warmup_ratio 0.05 \ --weight_decay 0.05 \ --logging_strategy steps \ --logging_steps 10 \ --eval_steps 50 \ --evaluation_strategy steps \ --save_steps 500 \ --save_strategy steps \ --save_total_limit 13 \ --gradient_accumulation_steps 1 \ --preprocessing_num_workers 4 \ --output_dir outputs-sft-chatglm-v1 \ --overwrite_output_dir \ --ddp_timeout 30000 \ --logging_first_step True \ --target_modules all \ --lora_rank 8 \ --lora_alpha 16 \ --lora_dropout 0.05 \ --torch_dtype float16 \ --device_map auto \ --report_to tensorboard \ --ddp_find_unused_parameters False \ --gradient_checkpointing True \ --cache_dir ./cache

shibing624 commented 10 months ago

CUDA_VISIBLE_DEVICES=1 python supervised_finetuning.py

dshwei commented 10 months ago

CUDA_VISIBLE_DEVICES=1 torchrun --nproc_per_node 1 supervised_finetuning.py 有哪里有错误吗，在run_pt.sh 是可以正常运行的

dshwei commented 10 months ago

CUDA_VISIBLE_DEVICES=1 python supervised_finetuning.py 会出现这个TypeError: _set_gradient_checkpointing() got an unexpected keyword argument 'enable'

shibing624 commented 10 months ago

注释_set_gradient_checkpointing这行

shibing624 / MedicalGPT

[2023-11-20 19:18:31,206] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 2061498) of binary #265

Describe the bug