(gpt) root@autodl-container-9e2911833c-bcf1743a:~/autodl-tmp/swift-main/examples/pytorch/llm# CUDA_VISIBLE_DEVICES=0 python src/llm_sft.py --model_type qwen-7b --sft_type lora --dtype bf16 --output_dir runs --dataset alpaca-en,alpaca-zh --dataset_sample -1 --num_train_epochs 1 --max_length 1024 --quantization_bit 4 --lora_rank 64 --lora_alpha 32 --lora_dropout_p 0.05 --lora_target_modules ALL --batch_size 1 --weight_decay 0. --learning_rate 1e-4 --gradient_accumulation_steps 16 --max_grad_norm 0.5 --warmup_ratio 0.03 --eval_steps 50 --save_steps 50 --save_total_limit 2 --logging_steps 10 --use_flash_attn false --push_to_hub false --hub_model_id qwen-7b-qlora --hub_private_repo true --hub_token 'your-sdk-token' 2023-08-24 15:54:28,792 - modelscope - INFO - PyTorch version 2.0.1+cu118 Found. 2023-08-24 15:54:28,793 - modelscope - INFO - Loading ast index from /root/autodl-tmp/.cache/modelscope/hub/ast_indexer 2023-08-24 15:54:28,829 - modelscope - INFO - Loading done! Current index file version is 1.8.1, with md5 1f897f6541cc699224f7379a0c996b2e and a total number of 893 components indexed

===================================BUG REPORT=================================== Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

issues

bin /root/miniconda3/envs/gpt/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda113.so /root/miniconda3/envs/gpt/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: Found duplicate ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] files: {PosixPath('/root/miniconda3/envs/gpt/lib/libcudart.so'), PosixPath('/root/miniconda3/envs/gpt/lib/libcudart.so.11.0')}.. We'll flip a coin and try one of these, in order to fail forward. Either way, this might cause trouble in the future: If you get CUDA error: invalid device function errors, the above might be the cause and the solution is to make sure only one ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] in the paths that we search based on your env. warn(msg) CUDA SETUP: CUDA runtime path found: /root/miniconda3/envs/gpt/lib/libcudart.so CUDA SETUP: Highest compute capability among GPUs detected: 8.6 CUDA SETUP: Detected CUDA version 113 CUDA SETUP: Loading binary /root/miniconda3/envs/gpt/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda113.so... 2023-08-24 15:54:31,955 - swift - INFO - Setting template_type: chatml 2023-08-24 15:54:31,955 - swift - INFO - args: SftArguments(model_type='qwen-7b', sft_type='lora', template_type='chatml', output_dir='runs/qwen-7b', ddp_backend=None, seed=42, resume_from_ckpt=None, dtype='bf16', ignore_args_error=False, dataset='alpaca-en,alpaca-zh', dataset_seed=42, dataset_sample=-1, dataset_test_size=0.01, system='you are a helpful assistant!', max_length=1024, quantization_bit=4, bnb_4bit_comp_dtype='bf16', bnb_4bit_quant_type='nf4', bnb_4bit_use_double_quant=True, lora_target_modules=['ALL'], lora_rank=64, lora_alpha=32, lora_dropout_p=0.05, gradient_checkpoint=True, batch_size=1, num_train_epochs=1, optim='adamw_torch', learning_rate=0.0001, weight_decay=0.0, gradient_accumulation_steps=16, max_grad_norm=0.5, lr_scheduler_type='cosine', warmup_ratio=0.03, eval_steps=50, save_steps=50, save_total_limit=2, logging_steps=10, push_to_hub=False, hub_model_id='qwen-7b-qlora', hub_private_repo=True, hub_strategy='every_save', hub_token='your-sdk-token', use_flash_attn=False) device_count: 1 rank: -1, local_rank: -1, world_size: 1, local_world_size: 1 2023-08-24 15:54:31,955 - swift - INFO - Global seed set to 42 2023-08-24 15:54:31,956 - swift - INFO - quantization_config: {'load_in_8bit': False, 'load_in_4bit': True, 'llm_int8_threshold': 6.0, 'llm_int8_skip_modules': None, 'llm_int8_enable_fp32_cpu_offload': False, 'llm_int8_has_fp16_weight': False, 'bnb_4bit_quant_type': 'nf4', 'bnb_4bit_use_double_quant': True, 'bnb_4bit_compute_dtype': torch.bfloat16} 2023-08-24 15:54:32,165 - modelscope - INFO - Use user-specified model revision: v.1.0.4 2023-08-24 15:54:32,460 - swift - INFO - model_config: QWenConfig { "_name_or_path": "/root/autodl-tmp/.cache/modelscope/hub/qwen/Qwen-7B", "activation": "swiglu", "apply_residual_connection_post_layernorm": false, "architectures": [ "QWenLMHeadModel" ], "attn_pdrop": 0.0, "auto_map": { "AutoConfig": "configuration_qwen.QWenConfig", "AutoModelForCausalLM": "modeling_qwen.QWenLMHeadModel" }, "bf16": true, "bias_dropout_fusion": true, "bos_token_id": 151643, "embd_pdrop": 0.0, "eos_token_id": 151643, "ffn_hidden_size": 22016, "fp16": false, "fp32": false, "initializer_range": 0.02, "kv_channels": 128, "layer_norm_epsilon": 1e-06, "model_dir": "/root/autodl-tmp/.cache/modelscope/hub/qwen/Qwen-7B", "model_type": "qwen", "n_embd": 4096, "n_head": 32, "n_inner": null, "n_layer": 32, "n_positions": 6144, "no_bias": true, "onnx_safe": null, "padded_vocab_size": 151936, "params_dtype": "torch.bfloat16", "pos_emb": "rotary", "resid_pdrop": 0.1, "rotary_emb_base": 10000, "rotary_pct": 1.0, "scale_attn_weights": true, "seq_length": 2048, "tie_word_embeddings": false, "tokenizer_type": "QWenTokenizer", "torch_dtype": "bfloat16", "transformers_version": "4.30.2", "use_cache": true, "use_dynamic_ntk": true, "use_flash_attn": false, "use_logn_attn": true, "vocab_size": 151936 }

Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:15<00:00, 1.97s/it] Using pad_token, but it is not set yet. 2023-08-24 15:54:53,136 - swift - INFO - Setting lora_target_modules: ['c_attn', 'w1', 'c_proj', 'w2'] 2023-08-24 15:54:53,136 - swift - INFO - lora_config: get_wrapped_class..PeftWrapper(peft_type=<PeftType.LORA: 'LORA'>, base_model_name_or_path=None, task_type='CAUSAL_LM', inference_mode=False, r=64, target_modules=['c_attn', 'w1', 'c_proj', 'w2'], lora_alpha=32, lora_dropout=0.05, fan_in_fan_out=False, bias='none', modules_to_save=None, init_lora_weights=True) 2023-08-24 15:56:29,482 - swift - INFO - [base_model.model.transformer.wte.weight]: requires_grad=False, dtype=torch.bfloat16, device=cuda:0 2023-08-24 15:56:29,482 - swift - INFO - [base_model.model.transformer.h.0.ln_1.weight]: requires_grad=False, dtype=torch.bfloat16, device=cuda:0 2023-08-24 15:56:29,482 - swift - INFO - [base_model.model.transformer.h.0.attn.c_attn.weight]: requires_grad=False, dtype=torch.uint8, device=cuda:0 2023-08-24 15:56:29,482 - swift - INFO - [base_model.model.transformer.h.0.attn.c_attn.bias]: requires_grad=False, dtype=torch.bfloat16, device=cuda:0 2023-08-24 15:56:29,482 - swift - INFO - [base_model.model.transformer.h.0.attn.c_attn.lora_A.default.weight]: requires_grad=True, dtype=torch.float32, device=cuda:0 2023-08-24 15:56:29,482 - swift - INFO - [base_model.model.transformer.h.0.attn.c_attn.lora_B.default.weight]: requires_grad=True, dtype=torch.float32, device=cuda:0 2023-08-24 15:56:29,482 - swift - INFO - [base_model.model.transformer.h.0.attn.c_proj.weight]: requires_grad=False, dtype=torch.uint8, device=cuda:0 2023-08-24 15:56:29,482 - swift - INFO - [base_model.model.transformer.h.0.attn.c_proj.lora_A.default.weight]: requires_grad=True, dtype=torch.float32, device=cuda:0 2023-08-24 15:56:29,482 - swift - INFO - [base_model.model.transformer.h.0.attn.c_proj.lora_B.default.weight]: requires_grad=True, dtype=torch.float32, device=cuda:0 2023-08-24 15:56:29,482 - swift - INFO - [base_model.model.transformer.h.0.ln_2.weight]: requires_grad=False, dtype=torch.bfloat16, device=cuda:0 2023-08-24 15:56:29,482 - swift - INFO - [base_model.model.transformer.h.0.mlp.w1.weight]: requires_grad=False, dtype=torch.uint8, device=cuda:0 2023-08-24 15:56:29,482 - swift - INFO - [base_model.model.transformer.h.0.mlp.w1.lora_A.default.weight]: requires_grad=True, dtype=torch.float32, device=cuda:0 2023-08-24 15:56:29,482 - swift - INFO - [base_model.model.transformer.h.0.mlp.w1.lora_B.default.weight]: requires_grad=True, dtype=torch.float32, device=cuda:0 2023-08-24 15:56:29,482 - swift - INFO - [base_model.model.transformer.h.0.mlp.w2.weight]: requires_grad=False, dtype=torch.uint8, device=cuda:0 2023-08-24 15:56:29,482 - swift - INFO - [base_model.model.transformer.h.0.mlp.w2.lora_A.default.weight]: requires_grad=True, dtype=torch.float32, device=cuda:0 2023-08-24 15:56:29,482 - swift - INFO - [base_model.model.transformer.h.0.mlp.w2.lora_B.default.weight]: requires_grad=True, dtype=torch.float32, device=cuda:0 2023-08-24 15:56:29,482 - swift - INFO - [base_model.model.transformer.h.0.mlp.c_proj.weight]: requires_grad=False, dtype=torch.uint8, device=cuda:0 2023-08-24 15:56:29,482 - swift - INFO - [base_model.model.transformer.h.0.mlp.c_proj.lora_A.default.weight]: requires_grad=True, dtype=torch.float32, device=cuda:0 2023-08-24 15:56:29,482 - swift - INFO - [base_model.model.transformer.h.0.mlp.c_proj.lora_B.default.weight]: requires_grad=True, dtype=torch.float32, device=cuda:0 2023-08-24 15:56:29,482 - swift - INFO - [base_model.model.transformer.h.1.ln_1.weight]: requires_grad=False, dtype=torch.bfloat16, device=cuda:0 2023-08-24 15:56:29,483 - swift - INFO - ... 2023-08-24 15:56:29,492 - swift - INFO - PeftModelForCausalLM: 4626.4525M Params (143.1306M Trainable), 1207.9596M Buffers. 2023-08-24 15:56:29,493 - modelscope - INFO - No subset_name specified, defaulting to the default 2023-08-24 15:56:30,026 - modelscope - WARNING - Reusing dataset alpaca-gpt4-data-en (/root/.cache/modelscope/hub/datasets/AI-ModelScope/alpaca-gpt4-data-en/master/data_files) 2023-08-24 15:56:30,026 - modelscope - INFO - Generating dataset alpaca-gpt4-data-en (/root/.cache/modelscope/hub/datasets/AI-ModelScope/alpaca-gpt4-data-en/master/data_files) 2023-08-24 15:56:30,026 - modelscope - INFO - Reusing cached meta-data file: /root/.cache/modelscope/hub/datasets/AI-ModelScope/alpaca-gpt4-data-en/master/data_files/66247e987561e76d71cc064cb302eb31 Downloading data files: 0it [00:00, ?it/s] Extracting data files: 0it [00:00, ?it/s] 2023-08-24 15:56:31,834 - modelscope - INFO - No subset_name specified, defaulting to the default 2023-08-24 15:56:32,312 - modelscope - WARNING - Reusing dataset alpaca-gpt4-data-zh (/root/.cache/modelscope/hub/datasets/AI-ModelScope/alpaca-gpt4-data-zh/master/data_files) 2023-08-24 15:56:32,312 - modelscope - INFO - Generating dataset alpaca-gpt4-data-zh (/root/.cache/modelscope/hub/datasets/AI-ModelScope/alpaca-gpt4-data-zh/master/data_files) 2023-08-24 15:56:32,312 - modelscope - INFO - Reusing cached meta-data file: /root/.cache/modelscope/hub/datasets/AI-ModelScope/alpaca-gpt4-data-zh/master/data_files/d17e7f3c34d5d65c37d14ef32c78bfc3 Downloading data files: 0it [00:00, ?it/s] Extracting data files: 0it [00:00, ?it/s] 2023-08-24 15:58:03,046 - swift - INFO - Dataset Token Length: 170.389767±111.748190, min=27.000000, max=857.000000, size=99811
2023-08-24 15:58:03,226 - swift - INFO - Dataset Token Length: 174.365709±110.360343, min=31.000000, max=557.000000, size=1009 2023-08-24 15:58:03,227 - swift - INFO - [INPUT_IDS] [151644, 8948, 198, 9330, 525, 264, 10950, 17847, 0, 151645, 198, 151644, 872, 198, 58465, 1247, 279, 2701, 11652, 311, 1281, 432, 16245, 1447, 785, 4143, 525, 12035, 911, 862, 14487, 16319, 624, 151645, 198, 151644, 77091, 198, 785, 4143, 525, 1411, 40033, 448, 27262, 323, 49819, 369, 862, 14487, 16319, 13, 151645, 151643] 2023-08-24 15:58:03,227 - swift - INFO - [INPUT] <|im_start|>system you are a helpful assistant!<|im_end|> <|im_start|>user Rewrite the following sentence to make it stronger:

The students are excited about their upcoming assignment. <|im_end|> <|im_start|>assistant The students are brimming with excitement and anticipation for their upcoming assignment.<|im_end|><|endoftext|> 2023-08-24 15:58:03,227 - swift - INFO - [LABLES_IDS] [-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, 785, 4143, 525, 1411, 40033, 448, 27262, 323, 49819, 369, 862, 14487, 16319, 13, 151645, 151643] 2023-08-24 15:58:03,227 - swift - INFO - [LABLES] [-100 * 38]The students are brimming with excitement and anticipation for their upcoming assignment.<|im_end|><|endoftext|> 2023-08-24 15:58:03,228 - swift - INFO - work_dir: /root/autodl-tmp/swift-main/examples/pytorch/llm/runs/qwen-7b/v0-20230824-155803 2023-08-24 15:58:03,231 - swift - INFO - trainer_args: Seq2SeqTrainingArguments( _n_gpu=1, adafactor=False, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, auto_find_batch_size=False, bf16=True, bf16_full_eval=False, data_seed=None, dataloader_drop_last=False, dataloader_num_workers=1, dataloader_pin_memory=True, ddp_backend=None, ddp_bucket_cap_mb=None, ddp_find_unused_parameters=None, ddp_timeout=1800, debug=[], deepspeed=None, disable_tqdm=False, do_eval=True, do_predict=False, do_train=True, eval_accumulation_steps=None, eval_delay=0, eval_steps=50, evaluation_strategy=steps, fp16=False, fp16_backend=auto, fp16_full_eval=False, fp16_opt_level=O1, fsdp=[], fsdp_config={'fsdp_min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False}, fsdp_min_num_params=0, fsdp_transformer_layer_cls_to_wrap=None, full_determinism=False, generation_config=None, generation_max_length=None, generation_num_beams=None, gradient_accumulation_steps=16, gradient_checkpointing=True, greater_is_better=False, group_by_length=False, half_precision_backend=auto, hub_model_id=qwen-7b-qlora, hub_private_repo=True, hub_strategy=every_save, hub_token=, ignore_data_skip=False, include_inputs_for_metrics=False, jit_mode_eval=False, label_names=None, label_smoothing_factor=0.0, learning_rate=0.0001, length_column_name=length, load_best_model_at_end=True, local_rank=0, log_level=passive, log_level_replica=warning, log_on_each_node=True, logging_dir=/root/autodl-tmp/swift-main/examples/pytorch/llm/runs/qwen-7b/v0-20230824-155803/runs/Aug24_15-58-03_autodl-container-9e2911833c-bcf1743a, logging_first_step=False, logging_nan_inf_filter=True, logging_steps=10, logging_strategy=steps, lr_scheduler_type=cosine, max_grad_norm=0.5, max_steps=-1, metric_for_best_model=loss, mp_parameters=, no_cuda=False, num_train_epochs=1, optim=adamw_torch, optim_args=None, output_dir=/root/autodl-tmp/swift-main/examples/pytorch/llm/runs/qwen-7b/v0-20230824-155803, overwrite_output_dir=False, past_index=-1, per_device_eval_batch_size=1, per_device_train_batch_size=1, predict_with_generate=False, prediction_loss_only=False, push_to_hub=False, push_to_hub_model_id=None, push_to_hub_organization=None, push_to_hub_token=, ray_scope=last, remove_unused_columns=True, report_to=['tensorboard', 'wandb'], resume_from_checkpoint=None, run_name=/root/autodl-tmp/swift-main/examples/pytorch/llm/runs/qwen-7b/v0-20230824-155803, save_on_each_node=False, save_safetensors=False, save_steps=50, save_strategy=steps, save_total_limit=2, seed=42, sharded_ddp=[], skip_memory_metrics=True, sortish_sampler=True, tf32=None, torch_compile=False, torch_compile_backend=None, torch_compile_mode=None, torchdynamo=None, tpu_metrics_debug=False, tpu_num_cores=None, use_ipex=False, use_legacy_prediction_loop=False, use_mps_device=False, warmup_ratio=0.03, warmup_steps=0, weight_decay=0.0, xpu_backend=None, ) 2023-08-24 15:58:03,755 - swift - INFO - Model file config.json is different from the latest version v1.0.5,This is because you are using an older version or the file is updated manually. 0%| | 0/6238 [00:00<?, ?it/s]use_cache=True is incompatible with gradient checkpointing. Setting use_cache=False... Traceback (most recent call last): File "/root/autodl-tmp/swift-main/examples/pytorch/llm/src/llm_sft.py", line 323, in llm_sft(args) File "/root/autodl-tmp/swift-main/examples/pytorch/llm/src/llm_sft.py", line 301, in llm_sft trainer.train(trainer_args.resume_from_checkpoint) File "/root/miniconda3/envs/gpt/lib/python3.10/site-packages/transformers/trainer.py", line 1645, in train return inner_training_loop( File "/root/miniconda3/envs/gpt/lib/python3.10/site-packages/transformers/trainer.py", line 1938, in _inner_training_loop tr_loss_step = self.training_step(model, inputs) File "/root/miniconda3/envs/gpt/lib/python3.10/site-packages/transformers/trainer.py", line 2759, in training_step loss = self.compute_loss(model, inputs) File "/root/miniconda3/envs/gpt/lib/python3.10/site-packages/transformers/trainer.py", line 2784, in compute_loss outputs = model(inputs) File "/root/miniconda3/envs/gpt/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, *kwargs) File "/root/miniconda3/envs/gpt/lib/python3.10/site-packages/accelerate/utils/operations.py", line 581, in forward return model_forward(args, kwargs) File "/root/miniconda3/envs/gpt/lib/python3.10/site-packages/accelerate/utils/operations.py", line 569, in call return convert_to_fp32(self.model_forward(*args, kwargs)) File "/root/miniconda3/envs/gpt/lib/python3.10/site-packages/torch/amp/autocast_mode.py", line 14, in decorate_autocast return func(*args, *kwargs) File "/root/miniconda3/envs/gpt/lib/python3.10/site-packages/peft/peft_model.py", line 678, in forward return self.base_model( File "/root/miniconda3/envs/gpt/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(args, kwargs) File "/root/miniconda3/envs/gpt/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward output = old_forward(*args, kwargs) File "/root/autodl-tmp/.cache/huggingface/hub/modules/transformers_modules/Qwen-7B/modeling_qwen.py", line 925, in forward transformer_outputs = self.transformer( File "/root/miniconda3/envs/gpt/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, *kwargs) File "/root/miniconda3/envs/gpt/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward output = old_forward(args, kwargs) File "/root/autodl-tmp/.cache/huggingface/hub/modules/transformers_modules/Qwen-7B/modeling_qwen.py", line 756, in forward outputs = torch.utils.checkpoint.checkpoint( File "/root/miniconda3/envs/gpt/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 249, in checkpoint return CheckpointFunction.apply(function, preserve, args) File "/root/miniconda3/envs/gpt/lib/python3.10/site-packages/torch/autograd/function.py", line 506, in apply return super().apply(args, kwargs) # type: ignore[misc] File "/root/miniconda3/envs/gpt/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 107, in forward outputs = run_function(args) File "/root/autodl-tmp/.cache/huggingface/hub/modules/transformers_modules/Qwen-7B/modeling_qwen.py", line 752, in custom_forward return module(inputs, use_cache, output_attentions) File "/root/miniconda3/envs/gpt/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, *kwargs) File "/root/miniconda3/envs/gpt/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward output = old_forward(args, kwargs) File "/root/autodl-tmp/.cache/huggingface/hub/modules/transformers_modules/Qwen-7B/modeling_qwen.py", line 523, in forward attn_outputs = self.attn( File "/root/miniconda3/envs/gpt/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, kwargs) File "/root/miniconda3/envs/gpt/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward output = old_forward(*args, *kwargs) File "/root/autodl-tmp/.cache/huggingface/hub/modules/transformers_modules/Qwen-7B/modeling_qwen.py", line 367, in forward mixed_x_layer = self.c_attn(hidden_states) File "/root/miniconda3/envs/gpt/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(args, kwargs) File "/root/miniconda3/envs/gpt/lib/python3.10/site-packages/peft/tuners/lora.py", line 565, in forward result = F.linear(x, transpose(self.weight, self.fan_in_fan_out), bias=self.bias) RuntimeError: self and mat2 must have the same dtype

modelscope / ms-swift

RuntimeError: self and mat2 must have the same dtype #30

and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues