[X] I have read the README and searched the existing issues.
System Info
bin C:\Users\luoxiaojie.conda\envs\pytorch212-lxj\lib\site-packages\bitsandbytes\libbitsandbytes_cuda121.dll
CUDA SETUP: CUDA runtime path found: C:\Users\luoxiaojie.conda\envs\pytorch212-lxj\bin\cudart64_12.dll
CUDA SETUP: Highest compute capability among GPUs detected: 8.6
CUDA SETUP: Detected CUDA version 121
CUDA SETUP: Loading binary C:\Users\luoxiaojie.conda\envs\pytorch212-lxj\lib\site-packages\bitsandbytes\libbitsandbytes_cuda121.dll...
[2024-06-14 23:05:36,809] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-06-14 23:05:37,213] torch.distributed.elastic.multiprocessing.redirects: [WARNING] NOTE: Redirects are currently not supported in Windows or MacOs.
Traceback (most recent call last):
File "C:\Users\luoxiaojie.conda\envs\pytorch212-lxj\lib\runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "C:\Users\luoxiaojie.conda\envs\pytorch212-lxj\lib\runpy.py", line 87, in _run_code
exec(code, run_globals)
File "C:\Users\luoxiaojie.conda\envs\pytorch212-lxj\Scripts\llamafactory-cli.exe__main__.py", line 7, in
sys.exit(main())
File "C:\Users\luoxiaojie\LLaMA-Factory\src\llmtuner\cli.py", line 59, in main
raise NotImplementedError("Unknown command: {}".format(command))
NotImplementedError: Unknown command: env
Reproduction
进行微调时出现如下错误
`[2024-06-14 22:53:57,516] torch.distributed.elastic.multiprocessing.redirects: [WARNING] NOTE: Redirects are currently not supported in Windows or MacOs.
[W socket.cpp:663] [c10d] The client socket has failed to connect to [kubernetes.docker.internal]:29500 (system error: 10049 - �����������У�������ĵ�ַ��Ч��).
===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run
bin C:\Users\luoxiaojie.conda\envs\pytorch212-lxj\lib\site-packages\bitsandbytes\libbitsandbytes_cuda121.dll
CUDA SETUP: CUDA runtime path found: C:\Users\luoxiaojie.conda\envs\pytorch212-lxj\bin\cudart64_12.dll
CUDA SETUP: Highest compute capability among GPUs detected: 8.6
CUDA SETUP: Detected CUDA version 121
CUDA SETUP: Loading binary C:\Users\luoxiaojie.conda\envs\pytorch212-lxj\lib\site-packages\bitsandbytes\libbitsandbytes_cuda121.dll...
===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run
Loading checkpoint shards: 0%| | 0/7 [00:00<?, ?it/s]C:\Users\luoxiaojie.conda\envs\pytorch212-lxj\lib\site-packages\torch_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
return self.fget.get(instance, owner)()
Loading checkpoint shards: 0%| | 0/7 [00:00<?, ?it/s]C:\Users\luoxiaojie.conda\envs\pytorch212-lxj\lib\site-packages\torch_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
return self.fget.get(instance, owner)()
Loading checkpoint shards: 100%|██████████| 7/7 [00:41<00:00, 5.91s/it]
[INFO|modeling_utils.py:4350] 2024-06-14 22:54:52,300 >> All model checkpoint weights were used when initializing ChatGLMForConditionalGeneration.
[INFO|modeling_utils.py:4358] 2024-06-14 22:54:52,300 >> All the weights of ChatGLMForConditionalGeneration were initialized from the model checkpoint at chatglm2-6b.
If your task is similar to the task the model of the checkpoint was trained on, you can already use ChatGLMForConditionalGeneration for predictions without further training.
[INFO|modeling_utils.py:3895] 2024-06-14 22:54:52,303 >> Generation config file not found, using a generation config created from the model config.
[WARNING|modeling_utils.py:2132] 2024-06-14 22:54:52,305 >> You are using an old version of the checkpointing format that is deprecated (We will also silently ignore gradient_checkpointing_kwargs in case you passed it).Please update to the new format on your modeling file. To use the new format, you need to completely remove the definition of the method _set_gradient_checkpointing in your model.
Traceback (most recent call last):
File "C:\Users\luoxiaojie\ChatGLM-Efficient-Tuning\src\train_bash.py", line 25, in
main()
File "C:\Users\luoxiaojie\ChatGLM-Efficient-Tuning\src\train_bash.py", line 10, in main
run_sft(model_args, data_args, training_args, finetuning_args)
File "C:\Users\luoxiaojie\ChatGLM-Efficient-Tuning\src\glmtuner\tuner\sft\workflow.py", line 24, in run_sft
model, tokenizer = load_model_and_tokenizer(model_args, finetuning_args, training_args.do_train, stage="sft")
File "C:\Users\luoxiaojie\ChatGLM-Efficient-Tuning\src\glmtuner\tuner\core\loader.py", line 139, in load_model_and_tokenizer
model = init_adapter(model, model_args, finetuning_args, is_trainable)
File "C:\Users\luoxiaojie\ChatGLM-Efficient-Tuning\src\glmtuner\tuner\core\adapter.py", line 65, in init_adapter
assert os.path.exists(os.path.join(model_args.checkpoint_dir[0], WEIGHTS_NAME)), \
AssertionError: Provided path (checkpoint) does not contain a LoRA weight.
06/14/2024 22:54:52 - INFO - glmtuner.tuner.core.adapter - Fine-tuning method: LoRA
Loading checkpoint shards: 100%|██████████| 7/7 [00:41<00:00, 5.95s/it]
[WARNING|modeling_utils.py:2132] 2024-06-14 22:54:53,754 >> You are using an old version of the checkpointing format that is deprecated (We will also silently ignore gradient_checkpointing_kwargs in case you passed it).Please update to the new format on your modeling file. To use the new format, you need to completely remove the definition of the method _set_gradient_checkpointing in your model.
Traceback (most recent call last):
File "C:\Users\luoxiaojie\ChatGLM-Efficient-Tuning\src\train_bash.py", line 25, in
main()
File "C:\Users\luoxiaojie\ChatGLM-Efficient-Tuning\src\train_bash.py", line 10, in main
run_sft(model_args, data_args, training_args, finetuning_args)
File "C:\Users\luoxiaojie\ChatGLM-Efficient-Tuning\src\glmtuner\tuner\sft\workflow.py", line 24, in run_sft
model, tokenizer = load_model_and_tokenizer(model_args, finetuning_args, training_args.do_train, stage="sft")
File "C:\Users\luoxiaojie\ChatGLM-Efficient-Tuning\src\glmtuner\tuner\core\loader.py", line 139, in load_model_and_tokenizer
model = init_adapter(model, model_args, finetuning_args, is_trainable)
File "C:\Users\luoxiaojie\ChatGLM-Efficient-Tuning\src\glmtuner\tuner\core\adapter.py", line 65, in init_adapter
assert os.path.exists(os.path.join(model_args.checkpoint_dir[0], WEIGHTS_NAME)), \
AssertionError: Provided path (checkpoint) does not contain a LoRA weight.
06/14/2024 22:54:53 - INFO - glmtuner.tuner.core.adapter - Fine-tuning method: LoRA
[2024-06-14 22:54:54,586] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 26488 closing signal CTRL_C_EVENT
[2024-06-14 22:54:55,973] torch.distributed.elastic.agent.server.api: [WARNING] Received Signals.SIGINT death signal, shutting down workers
Traceback (most recent call last):
File "C:\Users\luoxiaojie.conda\envs\pytorch212-lxj\lib\runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "C:\Users\luoxiaojie.conda\envs\pytorch212-lxj\lib\runpy.py", line 87, in _run_code
exec(code, run_globals)
File "C:\Users\luoxiaojie.conda\envs\pytorch212-lxj\Scripts\accelerate.exe__main.py", line 7, in
File "C:\Users\luoxiaojie.conda\envs\pytorch212-lxj\lib\site-packages\accelerate\commands\accelerate_cli.py", line 47, in main
args.func(args)
File "C:\Users\luoxiaojie.conda\envs\pytorch212-lxj\lib\site-packages\accelerate\commands\launch.py", line 985, in launch_command
multi_gpu_launcher(args)
File "C:\Users\luoxiaojie.conda\envs\pytorch212-lxj\lib\site-packages\accelerate\commands\launch.py", line 654, in multi_gpu_launcher
distrib_run.run(args)
File "C:\Users\luoxiaojie.conda\envs\pytorch212-lxj\lib\site-packages\torch\distributed\run.py", line 797, in run
elastic_launch(
File "C:\Users\luoxiaojie.conda\envs\pytorch212-lxj\lib\site-packages\torch\distributed\launcher\api.py", line 134, in call__
return launch_agent(self._config, self._entrypoint, list(args))
File "C:\Users\luoxiaojie.conda\envs\pytorch212-lxj\lib\site-packages\torch\distributed\launcher\api.py", line 255, in launch_agent
result = agent.run()
File "C:\Users\luoxiaojie.conda\envs\pytorch212-lxj\lib\site-packages\torch\distributed\elastic\metrics\api.py", line 124, in wrapper
result = f(*args, *kwargs)
File "C:\Users\luoxiaojie.conda\envs\pytorch212-lxj\lib\site-packages\torch\distributed\elastic\agent\server\api.py", line 736, in run
result = self._invoke_run(role)
File "C:\Users\luoxiaojie.conda\envs\pytorch212-lxj\lib\site-packages\torch\distributed\elastic\agent\server\api.py", line 878, in _invoke_run
run_result = self._monitor_workers(self._worker_group)
File "C:\Users\luoxiaojie.conda\envs\pytorch212-lxj\lib\site-packages\torch\distributed\elastic\metrics\api.py", line 124, in wrapper
result = f(args, **kwargs)
File "C:\Users\luoxiaojie.conda\envs\pytorch212-lxj\lib\site-packages\torch\distributed\elastic\agent\server\local_elastic_agent.py", line 307, in _monitor_workers
result = self._pcontext.wait(0)
File "C:\Users\luoxiaojie.conda\envs\pytorch212-lxj\lib\site-packages\torch\distributed\elastic\multiprocessing\api.py", line 288, in wait
return self._poll()
File "C:\Users\luoxiaojie.conda\envs\pytorch212-lxj\lib\site-packages\torch\distributed\elastic\multiprocessing\api.py", line 668, in _poll
self.close() # terminate all running procs
File "C:\Users\luoxiaojie.conda\envs\pytorch212-lxj\lib\site-packages\torch\distributed\elastic\multiprocessing\api.py", line 331, in close
self._close(death_sig=death_sig, timeout=timeout)
File "C:\Users\luoxiaojie.conda\envs\pytorch212-lxj\lib\site-packages\torch\distributed\elastic\multiprocessing\api.py", line 713, in _close
handler.proc.wait(time_to_wait)
File "C:\Users\luoxiaojie.conda\envs\pytorch212-lxj\lib\subprocess.py", line 1189, in wait
return self._wait(timeout=timeout)
File "C:\Users\luoxiaojie.conda\envs\pytorch212-lxj\lib\subprocess.py", line 1486, in _wait
result = _winapi.WaitForSingleObject(self._handle,
File "C:\Users\luoxiaojie.conda\envs\pytorch212-lxj\lib\site-packages\torch\distributed\elastic\multiprocessing\api.py", line 62, in _terminate_process_handler
raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval)
torch.distributed.elastic.multiprocessing.api.SignalException: Process 24700 got signal: 2
Traceback (most recent call last):
File "C:\Users\luoxiaojie\ChatGLM-Efficient-Tuning\run.py", line 18, in
os.system(command)
KeyboardInterrupt`
Reminder
System Info
bin C:\Users\luoxiaojie.conda\envs\pytorch212-lxj\lib\site-packages\bitsandbytes\libbitsandbytes_cuda121.dll CUDA SETUP: CUDA runtime path found: C:\Users\luoxiaojie.conda\envs\pytorch212-lxj\bin\cudart64_12.dll CUDA SETUP: Highest compute capability among GPUs detected: 8.6 CUDA SETUP: Detected CUDA version 121 CUDA SETUP: Loading binary C:\Users\luoxiaojie.conda\envs\pytorch212-lxj\lib\site-packages\bitsandbytes\libbitsandbytes_cuda121.dll... [2024-06-14 23:05:36,809] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-06-14 23:05:37,213] torch.distributed.elastic.multiprocessing.redirects: [WARNING] NOTE: Redirects are currently not supported in Windows or MacOs. Traceback (most recent call last): File "C:\Users\luoxiaojie.conda\envs\pytorch212-lxj\lib\runpy.py", line 197, in _run_module_as_main return _run_code(code, main_globals, None, File "C:\Users\luoxiaojie.conda\envs\pytorch212-lxj\lib\runpy.py", line 87, in _run_code exec(code, run_globals) File "C:\Users\luoxiaojie.conda\envs\pytorch212-lxj\Scripts\llamafactory-cli.exe__main__.py", line 7, in
sys.exit(main())
File "C:\Users\luoxiaojie\LLaMA-Factory\src\llmtuner\cli.py", line 59, in main
raise NotImplementedError("Unknown command: {}".format(command))
NotImplementedError: Unknown command: env
Reproduction
进行微调时出现如下错误 `[2024-06-14 22:53:57,516] torch.distributed.elastic.multiprocessing.redirects: [WARNING] NOTE: Redirects are currently not supported in Windows or MacOs. [W socket.cpp:663] [c10d] The client socket has failed to connect to [kubernetes.docker.internal]:29500 (system error: 10049 - �����������У�������ĵ�ַ��Ч��).
===================================BUG REPORT=================================== Welcome to bitsandbytes. For bug reports, please run
python -m bitsandbytes
and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
bin C:\Users\luoxiaojie.conda\envs\pytorch212-lxj\lib\site-packages\bitsandbytes\libbitsandbytes_cuda121.dll CUDA SETUP: CUDA runtime path found: C:\Users\luoxiaojie.conda\envs\pytorch212-lxj\bin\cudart64_12.dll CUDA SETUP: Highest compute capability among GPUs detected: 8.6 CUDA SETUP: Detected CUDA version 121 CUDA SETUP: Loading binary C:\Users\luoxiaojie.conda\envs\pytorch212-lxj\lib\site-packages\bitsandbytes\libbitsandbytes_cuda121.dll...
===================================BUG REPORT=================================== Welcome to bitsandbytes. For bug reports, please run
python -m bitsandbytes
and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
bin C:\Users\luoxiaojie.conda\envs\pytorch212-lxj\lib\site-packages\bitsandbytes\libbitsandbytes_cuda121.dll CUDA SETUP: CUDA runtime path found: C:\Users\luoxiaojie.conda\envs\pytorch212-lxj\bin\cudart64_12.dll CUDA SETUP: Highest compute capability among GPUs detected: 8.6 CUDA SETUP: Detected CUDA version 121 CUDA SETUP: Loading binary C:\Users\luoxiaojie.conda\envs\pytorch212-lxj\lib\site-packages\bitsandbytes\libbitsandbytes_cuda121.dll... [2024-06-14 22:54:07,123] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-06-14 22:54:07,272] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-06-14 22:54:07,672] torch.distributed.elastic.multiprocessing.redirects: [WARNING] NOTE: Redirects are currently not supported in Windows or MacOs. [2024-06-14 22:54:07,795] torch.distributed.elastic.multiprocessing.redirects: [WARNING] NOTE: Redirects are currently not supported in Windows or MacOs. [W socket.cpp:663] [c10d] The client socket has failed to connect to [kubernetes.docker.internal]:29500 (system error: 10049 - �����������У�������ĵ�ַ��Ч��). [W socket.cpp:663] [c10d] The client socket has failed to connect to [kubernetes.docker.internal]:29500 (system error: 10049 - �����������У�������ĵ�ַ��Ч��). C:\Users\luoxiaojie.conda\envs\pytorch212-lxj\lib\site-packages\datasets\load.py:2547: FutureWarning: 'use_auth_token' was deprecated in favor of 'token' in version 2.14.0 and will be removed in 3.0.0. You can remove this warning by passing 'token=' instead.
warnings.warn(
06/14/2024 22:54:09 - WARNING - glmtuner.tuner.core.parser - ,
ignore_data_skip=False,
include_inputs_for_metrics=False,
include_num_input_tokens_seen=False,
include_tokens_per_second=False,
jit_mode_eval=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=5e-05,
length_column_name=length,
load_best_model_at_end=False,
local_rank=0,
log_level=passive,
log_level_replica=warning,
log_on_each_node=True,
logging_dir=checkpoint-500\runs\Jun14_22-54-09_WIN-M0DSVMPGKA9,
logging_first_step=False,
logging_nan_inf_filter=True,
logging_steps=10,
logging_strategy=steps,
lr_scheduler_kwargs={},
lr_scheduler_type=cosine,
max_grad_norm=1.0,
max_steps=-1,
metric_for_best_model=None,
mp_parameters=,
neftune_noise_alpha=None,
no_cuda=False,
num_train_epochs=500.0,
optim=adamw_torch,
optim_args=None,
output_dir=checkpoint-500,
overwrite_output_dir=False,
past_index=-1,
per_device_eval_batch_size=8,
per_device_train_batch_size=16,
predict_with_generate=False,
prediction_loss_only=False,
push_to_hub=False,
push_to_hub_model_id=None,
push_to_hub_organization=None,
push_to_hub_token=,
ray_scope=last,
remove_unused_columns=True,
report_to=[],
resume_from_checkpoint=None,
run_name=checkpoint-500,
save_on_each_node=False,
save_only_model=False,
save_safetensors=True,
save_steps=1000,
save_strategy=steps,
save_total_limit=None,
seed=42,
skip_memory_metrics=True,
sortish_sampler=False,
split_batches=False,
tf32=None,
torch_compile=False,
torch_compile_backend=None,
torch_compile_mode=None,
torchdynamo=None,
tpu_metrics_debug=False,
tpu_num_cores=None,
use_cpu=False,
use_ipex=False,
use_legacy_prediction_loop=False,
use_mps_device=False,
warmup_ratio=0.0,
warmup_steps=0,
weight_decay=0.0,
)
06/14/2024 22:54:09 - INFO - glmtuner.dsets.loader - Loading dataset shuffled_output.json...
06/14/2024 22:54:09 - WARNING - glmtuner.dsets.loader - Checksum failed for data\shuffled_output.json. It may vary depending on the platform.
Using custom data configuration default-7f47027f643f514e
Loading Dataset Infos from C:\Users\luoxiaojie.conda\envs\pytorch212-lxj\lib\site-packages\datasets\packaged_modules\json
Overwrite dataset info from restored data version if exists.
Loading Dataset info from C:\Users\luoxiaojie.cache\huggingface\datasets/json/default-7f47027f643f514e/0.0.0/c8d2d9508a2a2067ab02cd118834ecef34c3700d143b31835ec4235bf10109f7
Found cached dataset json (C:/Users/luoxiaojie/.cache/huggingface/datasets/json/default-7f47027f643f514e/0.0.0/c8d2d9508a2a2067ab02cd118834ecef34c3700d143b31835ec4235bf10109f7)
Loading Dataset info from C:/Users/luoxiaojie/.cache/huggingface/datasets/json/default-7f47027f643f514e/0.0.0/c8d2d9508a2a2067ab02cd118834ecef34c3700d143b31835ec4235bf10109f7
[INFO|tokenization_utils_base.py:2025] 2024-06-14 22:54:10,425 >> loading file tokenizer.model
[INFO|tokenization_utils_base.py:2025] 2024-06-14 22:54:10,425 >> loading file added_tokens.json
[INFO|tokenization_utils_base.py:2025] 2024-06-14 22:54:10,425 >> loading file special_tokens_map.json
[INFO|tokenization_utils_base.py:2025] 2024-06-14 22:54:10,425 >> loading file tokenizer_config.json
[INFO|tokenization_utils_base.py:2025] 2024-06-14 22:54:10,425 >> loading file tokenizer.json
06/14/2024 22:54:10 - WARNING - glmtuner.tuner.core.parser - ,
ignore_data_skip=False,
include_inputs_for_metrics=False,
include_num_input_tokens_seen=False,
include_tokens_per_second=False,
jit_mode_eval=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=5e-05,
length_column_name=length,
load_best_model_at_end=False,
local_rank=1,
log_level=passive,
log_level_replica=warning,
log_on_each_node=True,
logging_dir=checkpoint-500\runs\Jun14_22-54-09_WIN-M0DSVMPGKA9,
logging_first_step=False,
logging_nan_inf_filter=True,
logging_steps=10,
logging_strategy=steps,
lr_scheduler_kwargs={},
lr_scheduler_type=cosine,
max_grad_norm=1.0,
max_steps=-1,
metric_for_best_model=None,
mp_parameters=,
neftune_noise_alpha=None,
no_cuda=False,
num_train_epochs=500.0,
optim=adamw_torch,
optim_args=None,
output_dir=checkpoint-500,
overwrite_output_dir=False,
past_index=-1,
per_device_eval_batch_size=8,
per_device_train_batch_size=16,
predict_with_generate=False,
prediction_loss_only=False,
push_to_hub=False,
push_to_hub_model_id=None,
push_to_hub_organization=None,
push_to_hub_token=,
ray_scope=last,
remove_unused_columns=True,
report_to=[],
resume_from_checkpoint=None,
run_name=checkpoint-500,
save_on_each_node=False,
save_only_model=False,
save_safetensors=True,
save_steps=1000,
save_strategy=steps,
save_total_limit=None,
seed=42,
skip_memory_metrics=True,
sortish_sampler=False,
split_batches=False,
tf32=None,
torch_compile=False,
torch_compile_backend=None,
torch_compile_mode=None,
torchdynamo=None,
tpu_metrics_debug=False,
tpu_num_cores=None,
use_cpu=False,
use_ipex=False,
use_legacy_prediction_loop=False,
use_mps_device=False,
warmup_ratio=0.0,
warmup_steps=0,
weight_decay=0.0,
)
06/14/2024 22:54:10 - INFO - glmtuner.dsets.loader - Loading dataset shuffled_output.json...
06/14/2024 22:54:10 - WARNING - glmtuner.dsets.loader - Checksum failed for data\shuffled_output.json. It may vary depending on the platform.
C:\Users\luoxiaojie.conda\envs\pytorch212-lxj\lib\site-packages\datasets\load.py:2547: FutureWarning: 'use_auth_token' was deprecated in favor of 'token' in version 2.14.0 and will be removed in 3.0.0.
You can remove this warning by passing 'token=' instead.
warnings.warn(
[INFO|configuration_utils.py:727] 2024-06-14 22:54:10,606 >> loading configuration file chatglm2-6b\config.json
[INFO|configuration_utils.py:727] 2024-06-14 22:54:10,608 >> loading configuration file chatglm2-6b\config.json
[INFO|configuration_utils.py:792] 2024-06-14 22:54:10,609 >> Model config ChatGLMConfig {
"_name_or_path": "chatglm2-6b",
"add_bias_linear": false,
"add_qkv_bias": true,
"apply_query_key_layer_scaling": true,
"apply_residual_connection_post_layernorm": false,
"architectures": [
"ChatGLMModel"
],
"attention_dropout": 0.0,
"attention_softmax_in_fp32": true,
"auto_map": {
"AutoConfig": "configuration_chatglm.ChatGLMConfig",
"AutoModel": "modeling_chatglm.ChatGLMForConditionalGeneration",
"AutoModelForCausalLM": "modeling_chatglm.ChatGLMForConditionalGeneration",
"AutoModelForSeq2SeqLM": "modeling_chatglm.ChatGLMForConditionalGeneration",
"AutoModelForSequenceClassification": "modeling_chatglm.ChatGLMForSequenceClassification"
},
"bias_dropout_fusion": true,
"classifier_dropout": null,
"eos_token_id": 2,
"ffn_hidden_size": 13696,
"fp32_residual_connection": false,
"hidden_dropout": 0.0,
"hidden_size": 4096,
"kv_channels": 128,
"layernorm_epsilon": 1e-05,
"model_type": "chatglm",
"multi_query_attention": true,
"multi_query_group_num": 2,
"num_attention_heads": 32,
"num_layers": 28,
"original_rope": true,
"pad_token_id": 0,
"padded_vocab_size": 65024,
"post_layer_norm": true,
"pre_seq_len": null,
"prefix_projection": false,
"quantization_bit": 0,
"rmsnorm": true,
"seq_length": 32768,
"tie_word_embeddings": false,
"torch_dtype": "float16",
"transformers_version": "4.37.2",
"use_cache": true,
"vocab_size": 65024
}
ddp_find_unused_parameters
needs to be set as False in DDP training. 06/14/2024 22:54:09 - INFO - glmtuner.tuner.core.parser - Process rank: 0, device: cuda:0, n_gpu: 1 distributed training: True, 16-bits training: True 06/14/2024 22:54:09 - INFO - glmtuner.tuner.core.parser - Training/evaluation parameters Seq2SeqTrainingArguments( _n_gpu=1, adafactor=False, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, auto_find_batch_size=False, bf16=False, bf16_full_eval=False, data_seed=None, dataloader_drop_last=False, dataloader_num_workers=0, dataloader_persistent_workers=False, dataloader_pin_memory=True, ddp_backend=None, ddp_broadcast_buffers=None, ddp_bucket_cap_mb=None, ddp_find_unused_parameters=False, ddp_timeout=1800, debug=[], deepspeed=None, disable_tqdm=False, dispatch_batches=None, do_eval=False, do_predict=False, do_train=True, eval_accumulation_steps=None, eval_delay=0, eval_steps=None, evaluation_strategy=no, fp16=True, fp16_backend=auto, fp16_full_eval=False, fp16_opt_level=O1, fsdp=[], fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False}, fsdp_min_num_params=0, fsdp_transformer_layer_cls_to_wrap=None, full_determinism=False, generation_config=None, generation_max_length=None, generation_num_beams=None, gradient_accumulation_steps=4, gradient_checkpointing=False, gradient_checkpointing_kwargs=None, greater_is_better=None, group_by_length=False, half_precision_backend=auto, hub_always_push=False, hub_model_id=None, hub_private_repo=False, hub_strategy=every_save, hub_token=ddp_find_unused_parameters
needs to be set as False in DDP training. 06/14/2024 22:54:10 - INFO - glmtuner.tuner.core.parser - Process rank: 1, device: cuda:1, n_gpu: 1 distributed training: True, 16-bits training: True 06/14/2024 22:54:10 - INFO - glmtuner.tuner.core.parser - Training/evaluation parameters Seq2SeqTrainingArguments( _n_gpu=1, adafactor=False, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, auto_find_batch_size=False, bf16=False, bf16_full_eval=False, data_seed=None, dataloader_drop_last=False, dataloader_num_workers=0, dataloader_persistent_workers=False, dataloader_pin_memory=True, ddp_backend=None, ddp_broadcast_buffers=None, ddp_bucket_cap_mb=None, ddp_find_unused_parameters=False, ddp_timeout=1800, debug=[], deepspeed=None, disable_tqdm=False, dispatch_batches=None, do_eval=False, do_predict=False, do_train=True, eval_accumulation_steps=None, eval_delay=0, eval_steps=None, evaluation_strategy=no, fp16=True, fp16_backend=auto, fp16_full_eval=False, fp16_opt_level=O1, fsdp=[], fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False}, fsdp_min_num_params=0, fsdp_transformer_layer_cls_to_wrap=None, full_determinism=False, generation_config=None, generation_max_length=None, generation_num_beams=None, gradient_accumulation_steps=4, gradient_checkpointing=False, gradient_checkpointing_kwargs=None, greater_is_better=None, group_by_length=False, half_precision_backend=auto, hub_always_push=False, hub_model_id=None, hub_private_repo=False, hub_strategy=every_save, hub_token=[INFO|modeling_utils.py:3473] 2024-06-14 22:54:10,860 >> loading weights file chatglm2-6b\pytorch_model.bin.index.json [INFO|configuration_utils.py:826] 2024-06-14 22:54:10,861 >> Generate config GenerationConfig { "eos_token_id": 2, "pad_token_id": 0 }
Loading checkpoint shards: 0%| | 0/7 [00:00<?, ?it/s]C:\Users\luoxiaojie.conda\envs\pytorch212-lxj\lib\site-packages\torch_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() return self.fget.get(instance, owner)() Loading checkpoint shards: 0%| | 0/7 [00:00<?, ?it/s]C:\Users\luoxiaojie.conda\envs\pytorch212-lxj\lib\site-packages\torch_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() return self.fget.get(instance, owner)() Loading checkpoint shards: 100%|██████████| 7/7 [00:41<00:00, 5.91s/it] [INFO|modeling_utils.py:4350] 2024-06-14 22:54:52,300 >> All model checkpoint weights were used when initializing ChatGLMForConditionalGeneration.
[INFO|modeling_utils.py:4358] 2024-06-14 22:54:52,300 >> All the weights of ChatGLMForConditionalGeneration were initialized from the model checkpoint at chatglm2-6b. If your task is similar to the task the model of the checkpoint was trained on, you can already use ChatGLMForConditionalGeneration for predictions without further training. [INFO|modeling_utils.py:3895] 2024-06-14 22:54:52,303 >> Generation config file not found, using a generation config created from the model config. [WARNING|modeling_utils.py:2132] 2024-06-14 22:54:52,305 >> You are using an old version of the checkpointing format that is deprecated (We will also silently ignore
main()
File "C:\Users\luoxiaojie\ChatGLM-Efficient-Tuning\src\train_bash.py", line 10, in main
run_sft(model_args, data_args, training_args, finetuning_args)
File "C:\Users\luoxiaojie\ChatGLM-Efficient-Tuning\src\glmtuner\tuner\sft\workflow.py", line 24, in run_sft
model, tokenizer = load_model_and_tokenizer(model_args, finetuning_args, training_args.do_train, stage="sft")
File "C:\Users\luoxiaojie\ChatGLM-Efficient-Tuning\src\glmtuner\tuner\core\loader.py", line 139, in load_model_and_tokenizer
model = init_adapter(model, model_args, finetuning_args, is_trainable)
File "C:\Users\luoxiaojie\ChatGLM-Efficient-Tuning\src\glmtuner\tuner\core\adapter.py", line 65, in init_adapter
assert os.path.exists(os.path.join(model_args.checkpoint_dir[0], WEIGHTS_NAME)), \
AssertionError: Provided path (checkpoint) does not contain a LoRA weight.
06/14/2024 22:54:52 - INFO - glmtuner.tuner.core.adapter - Fine-tuning method: LoRA
Loading checkpoint shards: 100%|██████████| 7/7 [00:41<00:00, 5.95s/it]
[WARNING|modeling_utils.py:2132] 2024-06-14 22:54:53,754 >> You are using an old version of the checkpointing format that is deprecated (We will also silently ignore
main()
File "C:\Users\luoxiaojie\ChatGLM-Efficient-Tuning\src\train_bash.py", line 10, in main
run_sft(model_args, data_args, training_args, finetuning_args)
File "C:\Users\luoxiaojie\ChatGLM-Efficient-Tuning\src\glmtuner\tuner\sft\workflow.py", line 24, in run_sft
model, tokenizer = load_model_and_tokenizer(model_args, finetuning_args, training_args.do_train, stage="sft")
File "C:\Users\luoxiaojie\ChatGLM-Efficient-Tuning\src\glmtuner\tuner\core\loader.py", line 139, in load_model_and_tokenizer
model = init_adapter(model, model_args, finetuning_args, is_trainable)
File "C:\Users\luoxiaojie\ChatGLM-Efficient-Tuning\src\glmtuner\tuner\core\adapter.py", line 65, in init_adapter
assert os.path.exists(os.path.join(model_args.checkpoint_dir[0], WEIGHTS_NAME)), \
AssertionError: Provided path (checkpoint) does not contain a LoRA weight.
06/14/2024 22:54:53 - INFO - glmtuner.tuner.core.adapter - Fine-tuning method: LoRA
[2024-06-14 22:54:54,586] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 26488 closing signal CTRL_C_EVENT
[2024-06-14 22:54:55,973] torch.distributed.elastic.agent.server.api: [WARNING] Received Signals.SIGINT death signal, shutting down workers
Traceback (most recent call last):
File "C:\Users\luoxiaojie.conda\envs\pytorch212-lxj\lib\runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "C:\Users\luoxiaojie.conda\envs\pytorch212-lxj\lib\runpy.py", line 87, in _run_code
exec(code, run_globals)
File "C:\Users\luoxiaojie.conda\envs\pytorch212-lxj\Scripts\accelerate.exe__main.py", line 7, in
File "C:\Users\luoxiaojie.conda\envs\pytorch212-lxj\lib\site-packages\accelerate\commands\accelerate_cli.py", line 47, in main
args.func(args)
File "C:\Users\luoxiaojie.conda\envs\pytorch212-lxj\lib\site-packages\accelerate\commands\launch.py", line 985, in launch_command
multi_gpu_launcher(args)
File "C:\Users\luoxiaojie.conda\envs\pytorch212-lxj\lib\site-packages\accelerate\commands\launch.py", line 654, in multi_gpu_launcher
distrib_run.run(args)
File "C:\Users\luoxiaojie.conda\envs\pytorch212-lxj\lib\site-packages\torch\distributed\run.py", line 797, in run
elastic_launch(
File "C:\Users\luoxiaojie.conda\envs\pytorch212-lxj\lib\site-packages\torch\distributed\launcher\api.py", line 134, in call__
return launch_agent(self._config, self._entrypoint, list(args))
File "C:\Users\luoxiaojie.conda\envs\pytorch212-lxj\lib\site-packages\torch\distributed\launcher\api.py", line 255, in launch_agent
result = agent.run()
File "C:\Users\luoxiaojie.conda\envs\pytorch212-lxj\lib\site-packages\torch\distributed\elastic\metrics\api.py", line 124, in wrapper
result = f(*args, *kwargs)
File "C:\Users\luoxiaojie.conda\envs\pytorch212-lxj\lib\site-packages\torch\distributed\elastic\agent\server\api.py", line 736, in run
result = self._invoke_run(role)
File "C:\Users\luoxiaojie.conda\envs\pytorch212-lxj\lib\site-packages\torch\distributed\elastic\agent\server\api.py", line 878, in _invoke_run
run_result = self._monitor_workers(self._worker_group)
File "C:\Users\luoxiaojie.conda\envs\pytorch212-lxj\lib\site-packages\torch\distributed\elastic\metrics\api.py", line 124, in wrapper
result = f(args, **kwargs)
File "C:\Users\luoxiaojie.conda\envs\pytorch212-lxj\lib\site-packages\torch\distributed\elastic\agent\server\local_elastic_agent.py", line 307, in _monitor_workers
result = self._pcontext.wait(0)
File "C:\Users\luoxiaojie.conda\envs\pytorch212-lxj\lib\site-packages\torch\distributed\elastic\multiprocessing\api.py", line 288, in wait
return self._poll()
File "C:\Users\luoxiaojie.conda\envs\pytorch212-lxj\lib\site-packages\torch\distributed\elastic\multiprocessing\api.py", line 668, in _poll
self.close() # terminate all running procs
File "C:\Users\luoxiaojie.conda\envs\pytorch212-lxj\lib\site-packages\torch\distributed\elastic\multiprocessing\api.py", line 331, in close
self._close(death_sig=death_sig, timeout=timeout)
File "C:\Users\luoxiaojie.conda\envs\pytorch212-lxj\lib\site-packages\torch\distributed\elastic\multiprocessing\api.py", line 713, in _close
handler.proc.wait(time_to_wait)
File "C:\Users\luoxiaojie.conda\envs\pytorch212-lxj\lib\subprocess.py", line 1189, in wait
return self._wait(timeout=timeout)
File "C:\Users\luoxiaojie.conda\envs\pytorch212-lxj\lib\subprocess.py", line 1486, in _wait
result = _winapi.WaitForSingleObject(self._handle,
File "C:\Users\luoxiaojie.conda\envs\pytorch212-lxj\lib\site-packages\torch\distributed\elastic\multiprocessing\api.py", line 62, in _terminate_process_handler
raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval)
torch.distributed.elastic.multiprocessing.api.SignalException: Process 24700 got signal: 2
Traceback (most recent call last):
File "C:\Users\luoxiaojie\ChatGLM-Efficient-Tuning\run.py", line 18, in
os.system(command)
KeyboardInterrupt`
gradient_checkpointing_kwargs
in case you passed it).Please update to the new format on your modeling file. To use the new format, you need to completely remove the definition of the method_set_gradient_checkpointing
in your model. Traceback (most recent call last): File "C:\Users\luoxiaojie\ChatGLM-Efficient-Tuning\src\train_bash.py", line 25, ingradient_checkpointing_kwargs
in case you passed it).Please update to the new format on your modeling file. To use the new format, you need to completely remove the definition of the method_set_gradient_checkpointing
in your model. Traceback (most recent call last): File "C:\Users\luoxiaojie\ChatGLM-Efficient-Tuning\src\train_bash.py", line 25, inExpected behavior
accelerate launch src/train_bash.py --stage sft --model_name_or_path chatglm2-6b --do_train --dataset shuffled_output --finetuning_type lora --output_dir checkpoint-500 --per_device_train_batch_size 16 --gradient_accumulation_steps 4 --lr_scheduler_type cosine --logging_steps 10 --save_steps 1000 --learning_rate 5e-5 --num_train_epochs 500.0 --plot_loss --fp16 --checkpoint_dir checkpoint"
以上是我的训练参数Others
No response