hiyouga / LLaMA-Factory

Unified Efficient Fine-Tuning of 100+ LLMs (ACL 2024)
https://arxiv.org/abs/2403.13372
Apache License 2.0
33.05k stars 4.07k forks source link

对ChatGLM2-6B模型进行Lora微调时分布式训练出现错误 #4296

Closed LiXibat-ai closed 4 months ago

LiXibat-ai commented 4 months ago

Reminder

System Info

bin C:\Users\luoxiaojie.conda\envs\pytorch212-lxj\lib\site-packages\bitsandbytes\libbitsandbytes_cuda121.dll CUDA SETUP: CUDA runtime path found: C:\Users\luoxiaojie.conda\envs\pytorch212-lxj\bin\cudart64_12.dll CUDA SETUP: Highest compute capability among GPUs detected: 8.6 CUDA SETUP: Detected CUDA version 121 CUDA SETUP: Loading binary C:\Users\luoxiaojie.conda\envs\pytorch212-lxj\lib\site-packages\bitsandbytes\libbitsandbytes_cuda121.dll... [2024-06-14 23:05:36,809] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-06-14 23:05:37,213] torch.distributed.elastic.multiprocessing.redirects: [WARNING] NOTE: Redirects are currently not supported in Windows or MacOs. Traceback (most recent call last): File "C:\Users\luoxiaojie.conda\envs\pytorch212-lxj\lib\runpy.py", line 197, in _run_module_as_main return _run_code(code, main_globals, None, File "C:\Users\luoxiaojie.conda\envs\pytorch212-lxj\lib\runpy.py", line 87, in _run_code exec(code, run_globals) File "C:\Users\luoxiaojie.conda\envs\pytorch212-lxj\Scripts\llamafactory-cli.exe__main__.py", line 7, in sys.exit(main()) File "C:\Users\luoxiaojie\LLaMA-Factory\src\llmtuner\cli.py", line 59, in main raise NotImplementedError("Unknown command: {}".format(command)) NotImplementedError: Unknown command: env

Reproduction

进行微调时出现如下错误 `[2024-06-14 22:53:57,516] torch.distributed.elastic.multiprocessing.redirects: [WARNING] NOTE: Redirects are currently not supported in Windows or MacOs. [W socket.cpp:663] [c10d] The client socket has failed to connect to [kubernetes.docker.internal]:29500 (system error: 10049 - �����������У�������ĵ�ַ��Ч��).

===================================BUG REPORT=================================== Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

bin C:\Users\luoxiaojie.conda\envs\pytorch212-lxj\lib\site-packages\bitsandbytes\libbitsandbytes_cuda121.dll CUDA SETUP: CUDA runtime path found: C:\Users\luoxiaojie.conda\envs\pytorch212-lxj\bin\cudart64_12.dll CUDA SETUP: Highest compute capability among GPUs detected: 8.6 CUDA SETUP: Detected CUDA version 121 CUDA SETUP: Loading binary C:\Users\luoxiaojie.conda\envs\pytorch212-lxj\lib\site-packages\bitsandbytes\libbitsandbytes_cuda121.dll...

===================================BUG REPORT=================================== Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

bin C:\Users\luoxiaojie.conda\envs\pytorch212-lxj\lib\site-packages\bitsandbytes\libbitsandbytes_cuda121.dll CUDA SETUP: CUDA runtime path found: C:\Users\luoxiaojie.conda\envs\pytorch212-lxj\bin\cudart64_12.dll CUDA SETUP: Highest compute capability among GPUs detected: 8.6 CUDA SETUP: Detected CUDA version 121 CUDA SETUP: Loading binary C:\Users\luoxiaojie.conda\envs\pytorch212-lxj\lib\site-packages\bitsandbytes\libbitsandbytes_cuda121.dll... [2024-06-14 22:54:07,123] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-06-14 22:54:07,272] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-06-14 22:54:07,672] torch.distributed.elastic.multiprocessing.redirects: [WARNING] NOTE: Redirects are currently not supported in Windows or MacOs. [2024-06-14 22:54:07,795] torch.distributed.elastic.multiprocessing.redirects: [WARNING] NOTE: Redirects are currently not supported in Windows or MacOs. [W socket.cpp:663] [c10d] The client socket has failed to connect to [kubernetes.docker.internal]:29500 (system error: 10049 - �����������У�������ĵ�ַ��Ч��). [W socket.cpp:663] [c10d] The client socket has failed to connect to [kubernetes.docker.internal]:29500 (system error: 10049 - �����������У�������ĵ�ַ��Ч��). C:\Users\luoxiaojie.conda\envs\pytorch212-lxj\lib\site-packages\datasets\load.py:2547: FutureWarning: 'use_auth_token' was deprecated in favor of 'token' in version 2.14.0 and will be removed in 3.0.0. You can remove this warning by passing 'token=' instead. warnings.warn( 06/14/2024 22:54:09 - WARNING - glmtuner.tuner.core.parser - ddp_find_unused_parameters needs to be set as False in DDP training. 06/14/2024 22:54:09 - INFO - glmtuner.tuner.core.parser - Process rank: 0, device: cuda:0, n_gpu: 1 distributed training: True, 16-bits training: True 06/14/2024 22:54:09 - INFO - glmtuner.tuner.core.parser - Training/evaluation parameters Seq2SeqTrainingArguments( _n_gpu=1, adafactor=False, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, auto_find_batch_size=False, bf16=False, bf16_full_eval=False, data_seed=None, dataloader_drop_last=False, dataloader_num_workers=0, dataloader_persistent_workers=False, dataloader_pin_memory=True, ddp_backend=None, ddp_broadcast_buffers=None, ddp_bucket_cap_mb=None, ddp_find_unused_parameters=False, ddp_timeout=1800, debug=[], deepspeed=None, disable_tqdm=False, dispatch_batches=None, do_eval=False, do_predict=False, do_train=True, eval_accumulation_steps=None, eval_delay=0, eval_steps=None, evaluation_strategy=no, fp16=True, fp16_backend=auto, fp16_full_eval=False, fp16_opt_level=O1, fsdp=[], fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False}, fsdp_min_num_params=0, fsdp_transformer_layer_cls_to_wrap=None, full_determinism=False, generation_config=None, generation_max_length=None, generation_num_beams=None, gradient_accumulation_steps=4, gradient_checkpointing=False, gradient_checkpointing_kwargs=None, greater_is_better=None, group_by_length=False, half_precision_backend=auto, hub_always_push=False, hub_model_id=None, hub_private_repo=False, hub_strategy=every_save, hub_token=, ignore_data_skip=False, include_inputs_for_metrics=False, include_num_input_tokens_seen=False, include_tokens_per_second=False, jit_mode_eval=False, label_names=None, label_smoothing_factor=0.0, learning_rate=5e-05, length_column_name=length, load_best_model_at_end=False, local_rank=0, log_level=passive, log_level_replica=warning, log_on_each_node=True, logging_dir=checkpoint-500\runs\Jun14_22-54-09_WIN-M0DSVMPGKA9, logging_first_step=False, logging_nan_inf_filter=True, logging_steps=10, logging_strategy=steps, lr_scheduler_kwargs={}, lr_scheduler_type=cosine, max_grad_norm=1.0, max_steps=-1, metric_for_best_model=None, mp_parameters=, neftune_noise_alpha=None, no_cuda=False, num_train_epochs=500.0, optim=adamw_torch, optim_args=None, output_dir=checkpoint-500, overwrite_output_dir=False, past_index=-1, per_device_eval_batch_size=8, per_device_train_batch_size=16, predict_with_generate=False, prediction_loss_only=False, push_to_hub=False, push_to_hub_model_id=None, push_to_hub_organization=None, push_to_hub_token=, ray_scope=last, remove_unused_columns=True, report_to=[], resume_from_checkpoint=None, run_name=checkpoint-500, save_on_each_node=False, save_only_model=False, save_safetensors=True, save_steps=1000, save_strategy=steps, save_total_limit=None, seed=42, skip_memory_metrics=True, sortish_sampler=False, split_batches=False, tf32=None, torch_compile=False, torch_compile_backend=None, torch_compile_mode=None, torchdynamo=None, tpu_metrics_debug=False, tpu_num_cores=None, use_cpu=False, use_ipex=False, use_legacy_prediction_loop=False, use_mps_device=False, warmup_ratio=0.0, warmup_steps=0, weight_decay=0.0, ) 06/14/2024 22:54:09 - INFO - glmtuner.dsets.loader - Loading dataset shuffled_output.json... 06/14/2024 22:54:09 - WARNING - glmtuner.dsets.loader - Checksum failed for data\shuffled_output.json. It may vary depending on the platform. Using custom data configuration default-7f47027f643f514e Loading Dataset Infos from C:\Users\luoxiaojie.conda\envs\pytorch212-lxj\lib\site-packages\datasets\packaged_modules\json Overwrite dataset info from restored data version if exists. Loading Dataset info from C:\Users\luoxiaojie.cache\huggingface\datasets/json/default-7f47027f643f514e/0.0.0/c8d2d9508a2a2067ab02cd118834ecef34c3700d143b31835ec4235bf10109f7 Found cached dataset json (C:/Users/luoxiaojie/.cache/huggingface/datasets/json/default-7f47027f643f514e/0.0.0/c8d2d9508a2a2067ab02cd118834ecef34c3700d143b31835ec4235bf10109f7) Loading Dataset info from C:/Users/luoxiaojie/.cache/huggingface/datasets/json/default-7f47027f643f514e/0.0.0/c8d2d9508a2a2067ab02cd118834ecef34c3700d143b31835ec4235bf10109f7 [INFO|tokenization_utils_base.py:2025] 2024-06-14 22:54:10,425 >> loading file tokenizer.model [INFO|tokenization_utils_base.py:2025] 2024-06-14 22:54:10,425 >> loading file added_tokens.json [INFO|tokenization_utils_base.py:2025] 2024-06-14 22:54:10,425 >> loading file special_tokens_map.json [INFO|tokenization_utils_base.py:2025] 2024-06-14 22:54:10,425 >> loading file tokenizer_config.json [INFO|tokenization_utils_base.py:2025] 2024-06-14 22:54:10,425 >> loading file tokenizer.json 06/14/2024 22:54:10 - WARNING - glmtuner.tuner.core.parser - ddp_find_unused_parameters needs to be set as False in DDP training. 06/14/2024 22:54:10 - INFO - glmtuner.tuner.core.parser - Process rank: 1, device: cuda:1, n_gpu: 1 distributed training: True, 16-bits training: True 06/14/2024 22:54:10 - INFO - glmtuner.tuner.core.parser - Training/evaluation parameters Seq2SeqTrainingArguments( _n_gpu=1, adafactor=False, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, auto_find_batch_size=False, bf16=False, bf16_full_eval=False, data_seed=None, dataloader_drop_last=False, dataloader_num_workers=0, dataloader_persistent_workers=False, dataloader_pin_memory=True, ddp_backend=None, ddp_broadcast_buffers=None, ddp_bucket_cap_mb=None, ddp_find_unused_parameters=False, ddp_timeout=1800, debug=[], deepspeed=None, disable_tqdm=False, dispatch_batches=None, do_eval=False, do_predict=False, do_train=True, eval_accumulation_steps=None, eval_delay=0, eval_steps=None, evaluation_strategy=no, fp16=True, fp16_backend=auto, fp16_full_eval=False, fp16_opt_level=O1, fsdp=[], fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False}, fsdp_min_num_params=0, fsdp_transformer_layer_cls_to_wrap=None, full_determinism=False, generation_config=None, generation_max_length=None, generation_num_beams=None, gradient_accumulation_steps=4, gradient_checkpointing=False, gradient_checkpointing_kwargs=None, greater_is_better=None, group_by_length=False, half_precision_backend=auto, hub_always_push=False, hub_model_id=None, hub_private_repo=False, hub_strategy=every_save, hub_token=, ignore_data_skip=False, include_inputs_for_metrics=False, include_num_input_tokens_seen=False, include_tokens_per_second=False, jit_mode_eval=False, label_names=None, label_smoothing_factor=0.0, learning_rate=5e-05, length_column_name=length, load_best_model_at_end=False, local_rank=1, log_level=passive, log_level_replica=warning, log_on_each_node=True, logging_dir=checkpoint-500\runs\Jun14_22-54-09_WIN-M0DSVMPGKA9, logging_first_step=False, logging_nan_inf_filter=True, logging_steps=10, logging_strategy=steps, lr_scheduler_kwargs={}, lr_scheduler_type=cosine, max_grad_norm=1.0, max_steps=-1, metric_for_best_model=None, mp_parameters=, neftune_noise_alpha=None, no_cuda=False, num_train_epochs=500.0, optim=adamw_torch, optim_args=None, output_dir=checkpoint-500, overwrite_output_dir=False, past_index=-1, per_device_eval_batch_size=8, per_device_train_batch_size=16, predict_with_generate=False, prediction_loss_only=False, push_to_hub=False, push_to_hub_model_id=None, push_to_hub_organization=None, push_to_hub_token=, ray_scope=last, remove_unused_columns=True, report_to=[], resume_from_checkpoint=None, run_name=checkpoint-500, save_on_each_node=False, save_only_model=False, save_safetensors=True, save_steps=1000, save_strategy=steps, save_total_limit=None, seed=42, skip_memory_metrics=True, sortish_sampler=False, split_batches=False, tf32=None, torch_compile=False, torch_compile_backend=None, torch_compile_mode=None, torchdynamo=None, tpu_metrics_debug=False, tpu_num_cores=None, use_cpu=False, use_ipex=False, use_legacy_prediction_loop=False, use_mps_device=False, warmup_ratio=0.0, warmup_steps=0, weight_decay=0.0, ) 06/14/2024 22:54:10 - INFO - glmtuner.dsets.loader - Loading dataset shuffled_output.json... 06/14/2024 22:54:10 - WARNING - glmtuner.dsets.loader - Checksum failed for data\shuffled_output.json. It may vary depending on the platform. C:\Users\luoxiaojie.conda\envs\pytorch212-lxj\lib\site-packages\datasets\load.py:2547: FutureWarning: 'use_auth_token' was deprecated in favor of 'token' in version 2.14.0 and will be removed in 3.0.0. You can remove this warning by passing 'token=' instead. warnings.warn( [INFO|configuration_utils.py:727] 2024-06-14 22:54:10,606 >> loading configuration file chatglm2-6b\config.json [INFO|configuration_utils.py:727] 2024-06-14 22:54:10,608 >> loading configuration file chatglm2-6b\config.json [INFO|configuration_utils.py:792] 2024-06-14 22:54:10,609 >> Model config ChatGLMConfig { "_name_or_path": "chatglm2-6b", "add_bias_linear": false, "add_qkv_bias": true, "apply_query_key_layer_scaling": true, "apply_residual_connection_post_layernorm": false, "architectures": [ "ChatGLMModel" ], "attention_dropout": 0.0, "attention_softmax_in_fp32": true, "auto_map": { "AutoConfig": "configuration_chatglm.ChatGLMConfig", "AutoModel": "modeling_chatglm.ChatGLMForConditionalGeneration", "AutoModelForCausalLM": "modeling_chatglm.ChatGLMForConditionalGeneration", "AutoModelForSeq2SeqLM": "modeling_chatglm.ChatGLMForConditionalGeneration", "AutoModelForSequenceClassification": "modeling_chatglm.ChatGLMForSequenceClassification" }, "bias_dropout_fusion": true, "classifier_dropout": null, "eos_token_id": 2, "ffn_hidden_size": 13696, "fp32_residual_connection": false, "hidden_dropout": 0.0, "hidden_size": 4096, "kv_channels": 128, "layernorm_epsilon": 1e-05, "model_type": "chatglm", "multi_query_attention": true, "multi_query_group_num": 2, "num_attention_heads": 32, "num_layers": 28, "original_rope": true, "pad_token_id": 0, "padded_vocab_size": 65024, "post_layer_norm": true, "pre_seq_len": null, "prefix_projection": false, "quantization_bit": 0, "rmsnorm": true, "seq_length": 32768, "tie_word_embeddings": false, "torch_dtype": "float16", "transformers_version": "4.37.2", "use_cache": true, "vocab_size": 65024 }

[INFO|modeling_utils.py:3473] 2024-06-14 22:54:10,860 >> loading weights file chatglm2-6b\pytorch_model.bin.index.json [INFO|configuration_utils.py:826] 2024-06-14 22:54:10,861 >> Generate config GenerationConfig { "eos_token_id": 2, "pad_token_id": 0 }

Loading checkpoint shards: 0%| | 0/7 [00:00<?, ?it/s]C:\Users\luoxiaojie.conda\envs\pytorch212-lxj\lib\site-packages\torch_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() return self.fget.get(instance, owner)() Loading checkpoint shards: 0%| | 0/7 [00:00<?, ?it/s]C:\Users\luoxiaojie.conda\envs\pytorch212-lxj\lib\site-packages\torch_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() return self.fget.get(instance, owner)() Loading checkpoint shards: 100%|██████████| 7/7 [00:41<00:00, 5.91s/it] [INFO|modeling_utils.py:4350] 2024-06-14 22:54:52,300 >> All model checkpoint weights were used when initializing ChatGLMForConditionalGeneration.

[INFO|modeling_utils.py:4358] 2024-06-14 22:54:52,300 >> All the weights of ChatGLMForConditionalGeneration were initialized from the model checkpoint at chatglm2-6b. If your task is similar to the task the model of the checkpoint was trained on, you can already use ChatGLMForConditionalGeneration for predictions without further training. [INFO|modeling_utils.py:3895] 2024-06-14 22:54:52,303 >> Generation config file not found, using a generation config created from the model config. [WARNING|modeling_utils.py:2132] 2024-06-14 22:54:52,305 >> You are using an old version of the checkpointing format that is deprecated (We will also silently ignore gradient_checkpointing_kwargs in case you passed it).Please update to the new format on your modeling file. To use the new format, you need to completely remove the definition of the method _set_gradient_checkpointing in your model. Traceback (most recent call last): File "C:\Users\luoxiaojie\ChatGLM-Efficient-Tuning\src\train_bash.py", line 25, in main() File "C:\Users\luoxiaojie\ChatGLM-Efficient-Tuning\src\train_bash.py", line 10, in main run_sft(model_args, data_args, training_args, finetuning_args) File "C:\Users\luoxiaojie\ChatGLM-Efficient-Tuning\src\glmtuner\tuner\sft\workflow.py", line 24, in run_sft model, tokenizer = load_model_and_tokenizer(model_args, finetuning_args, training_args.do_train, stage="sft") File "C:\Users\luoxiaojie\ChatGLM-Efficient-Tuning\src\glmtuner\tuner\core\loader.py", line 139, in load_model_and_tokenizer model = init_adapter(model, model_args, finetuning_args, is_trainable) File "C:\Users\luoxiaojie\ChatGLM-Efficient-Tuning\src\glmtuner\tuner\core\adapter.py", line 65, in init_adapter assert os.path.exists(os.path.join(model_args.checkpoint_dir[0], WEIGHTS_NAME)), \ AssertionError: Provided path (checkpoint) does not contain a LoRA weight. 06/14/2024 22:54:52 - INFO - glmtuner.tuner.core.adapter - Fine-tuning method: LoRA Loading checkpoint shards: 100%|██████████| 7/7 [00:41<00:00, 5.95s/it] [WARNING|modeling_utils.py:2132] 2024-06-14 22:54:53,754 >> You are using an old version of the checkpointing format that is deprecated (We will also silently ignore gradient_checkpointing_kwargs in case you passed it).Please update to the new format on your modeling file. To use the new format, you need to completely remove the definition of the method _set_gradient_checkpointing in your model. Traceback (most recent call last): File "C:\Users\luoxiaojie\ChatGLM-Efficient-Tuning\src\train_bash.py", line 25, in main() File "C:\Users\luoxiaojie\ChatGLM-Efficient-Tuning\src\train_bash.py", line 10, in main run_sft(model_args, data_args, training_args, finetuning_args) File "C:\Users\luoxiaojie\ChatGLM-Efficient-Tuning\src\glmtuner\tuner\sft\workflow.py", line 24, in run_sft model, tokenizer = load_model_and_tokenizer(model_args, finetuning_args, training_args.do_train, stage="sft") File "C:\Users\luoxiaojie\ChatGLM-Efficient-Tuning\src\glmtuner\tuner\core\loader.py", line 139, in load_model_and_tokenizer model = init_adapter(model, model_args, finetuning_args, is_trainable) File "C:\Users\luoxiaojie\ChatGLM-Efficient-Tuning\src\glmtuner\tuner\core\adapter.py", line 65, in init_adapter assert os.path.exists(os.path.join(model_args.checkpoint_dir[0], WEIGHTS_NAME)), \ AssertionError: Provided path (checkpoint) does not contain a LoRA weight. 06/14/2024 22:54:53 - INFO - glmtuner.tuner.core.adapter - Fine-tuning method: LoRA [2024-06-14 22:54:54,586] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 26488 closing signal CTRL_C_EVENT [2024-06-14 22:54:55,973] torch.distributed.elastic.agent.server.api: [WARNING] Received Signals.SIGINT death signal, shutting down workers Traceback (most recent call last): File "C:\Users\luoxiaojie.conda\envs\pytorch212-lxj\lib\runpy.py", line 197, in _run_module_as_main return _run_code(code, main_globals, None, File "C:\Users\luoxiaojie.conda\envs\pytorch212-lxj\lib\runpy.py", line 87, in _run_code exec(code, run_globals) File "C:\Users\luoxiaojie.conda\envs\pytorch212-lxj\Scripts\accelerate.exe__main.py", line 7, in File "C:\Users\luoxiaojie.conda\envs\pytorch212-lxj\lib\site-packages\accelerate\commands\accelerate_cli.py", line 47, in main args.func(args) File "C:\Users\luoxiaojie.conda\envs\pytorch212-lxj\lib\site-packages\accelerate\commands\launch.py", line 985, in launch_command multi_gpu_launcher(args) File "C:\Users\luoxiaojie.conda\envs\pytorch212-lxj\lib\site-packages\accelerate\commands\launch.py", line 654, in multi_gpu_launcher distrib_run.run(args) File "C:\Users\luoxiaojie.conda\envs\pytorch212-lxj\lib\site-packages\torch\distributed\run.py", line 797, in run elastic_launch( File "C:\Users\luoxiaojie.conda\envs\pytorch212-lxj\lib\site-packages\torch\distributed\launcher\api.py", line 134, in call__ return launch_agent(self._config, self._entrypoint, list(args)) File "C:\Users\luoxiaojie.conda\envs\pytorch212-lxj\lib\site-packages\torch\distributed\launcher\api.py", line 255, in launch_agent result = agent.run() File "C:\Users\luoxiaojie.conda\envs\pytorch212-lxj\lib\site-packages\torch\distributed\elastic\metrics\api.py", line 124, in wrapper result = f(*args, *kwargs) File "C:\Users\luoxiaojie.conda\envs\pytorch212-lxj\lib\site-packages\torch\distributed\elastic\agent\server\api.py", line 736, in run result = self._invoke_run(role) File "C:\Users\luoxiaojie.conda\envs\pytorch212-lxj\lib\site-packages\torch\distributed\elastic\agent\server\api.py", line 878, in _invoke_run run_result = self._monitor_workers(self._worker_group) File "C:\Users\luoxiaojie.conda\envs\pytorch212-lxj\lib\site-packages\torch\distributed\elastic\metrics\api.py", line 124, in wrapper result = f(args, **kwargs) File "C:\Users\luoxiaojie.conda\envs\pytorch212-lxj\lib\site-packages\torch\distributed\elastic\agent\server\local_elastic_agent.py", line 307, in _monitor_workers result = self._pcontext.wait(0) File "C:\Users\luoxiaojie.conda\envs\pytorch212-lxj\lib\site-packages\torch\distributed\elastic\multiprocessing\api.py", line 288, in wait return self._poll() File "C:\Users\luoxiaojie.conda\envs\pytorch212-lxj\lib\site-packages\torch\distributed\elastic\multiprocessing\api.py", line 668, in _poll self.close() # terminate all running procs File "C:\Users\luoxiaojie.conda\envs\pytorch212-lxj\lib\site-packages\torch\distributed\elastic\multiprocessing\api.py", line 331, in close self._close(death_sig=death_sig, timeout=timeout) File "C:\Users\luoxiaojie.conda\envs\pytorch212-lxj\lib\site-packages\torch\distributed\elastic\multiprocessing\api.py", line 713, in _close handler.proc.wait(time_to_wait) File "C:\Users\luoxiaojie.conda\envs\pytorch212-lxj\lib\subprocess.py", line 1189, in wait return self._wait(timeout=timeout) File "C:\Users\luoxiaojie.conda\envs\pytorch212-lxj\lib\subprocess.py", line 1486, in _wait result = _winapi.WaitForSingleObject(self._handle, File "C:\Users\luoxiaojie.conda\envs\pytorch212-lxj\lib\site-packages\torch\distributed\elastic\multiprocessing\api.py", line 62, in _terminate_process_handler raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval) torch.distributed.elastic.multiprocessing.api.SignalException: Process 24700 got signal: 2 Traceback (most recent call last): File "C:\Users\luoxiaojie\ChatGLM-Efficient-Tuning\run.py", line 18, in os.system(command) KeyboardInterrupt`

Expected behavior

accelerate launch src/train_bash.py --stage sft --model_name_or_path chatglm2-6b --do_train --dataset shuffled_output --finetuning_type lora --output_dir checkpoint-500 --per_device_train_batch_size 16 --gradient_accumulation_steps 4 --lr_scheduler_type cosine --logging_steps 10 --save_steps 1000 --learning_rate 5e-5 --num_train_epochs 500.0 --plot_loss --fp16 --checkpoint_dir checkpoint" 以上是我的训练参数

Others

No response

hiyouga commented 4 months ago

更新并安装:https://github.com/hiyouga/LLaMA-Factory