kongds / scaling_sentemb

Scaling Sentence Embeddings with Large Language Models
85 stars 4 forks source link

按照给定脚本微调llama-7b后测试时报错 #9

Closed guankaisi closed 7 months ago

guankaisi commented 7 months ago

transformers 版本为:4.31.0 按照train_llm.sh 脚本 qlora微调llama7b模型之后,使用bash eval_checkpoints.sh llama-7b-lora,命令测试效果时报错,报错如下

 llama-7b-lora/checkpoint-100

/data4/caoqian/pretrained_models/llama-2-7b-hf

===================================BUG REPORT===================================

================================================================================
bin /home/caoqian/anaconda3/envs/llama2cn/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cpu.so
/home/caoqian/anaconda3/envs/llama2cn/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: Found duplicate ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] files: {PosixPath('/home/caoqian/anaconda3/envs/llama2cn/lib/libcudart.so.11.0'), PosixPath('/home/caoqian/anaconda3/envs/llama2cn/lib/libcudart.so')}.. We'll flip a coin and try one of these, in order to fail forward.
Either way, this might cause trouble in the future:
If you get `CUDA error: invalid device function` errors, the above might be the cause and the solution is to make sure only one ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] in the paths that we search based on your env.
  warn(msg)
CUDA SETUP: WARNING! libcuda.so not found! Do you have a CUDA driver installed? If you are on a cluster, make sure you are on a CUDA machine!
CUDA SETUP: CUDA runtime path found: /home/caoqian/anaconda3/envs/llama2cn/lib/libcudart.so.11.0
/home/caoqian/anaconda3/envs/llama2cn/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: No GPU detected! Check your CUDA paths. Proceeding to load CPU-only library...
  warn(msg)
CUDA SETUP: Loading binary /home/caoqian/anaconda3/envs/llama2cn/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cpu.so...
[2023-12-01 11:46:26,872] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
2023-12-01 11:46:27,184 : Popen(['git', 'version'], cwd=/data4/kaisi/scaling_sentemb-main, universal_newlines=False, shell=None, istream=None)
2023-12-01 11:46:27,187 : Popen(['git', 'version'], cwd=/data4/kaisi/scaling_sentemb-main, universal_newlines=False, shell=None, istream=None)
2023-12-01 11:46:27,198 : Trying paths: ['/home/kaisi/.docker/config.json', '/home/kaisi/.dockercfg']
2023-12-01 11:46:27,198 : No config file found
2023-12-01 11:46:27,233 : [Tracing] Create new propagation context: {'trace_id': 'fbd9a337e1724006987c66aa2d40b4e0', 'span_id': '9665f6002883f92e', 'parent_span_id': None, 'dynamic_sampling_context': None}
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:09<00:00,  4.52s/it]
Some weights of LlamaForCausalLM were not initialized from the model checkpoint at /data4/caoqian/pretrained_models/llama-2-7b-hf and are newly initialized: ['model.layers.8.self_attn.rotary_emb.inv_freq', 'model.layers.23.self_attn.rotary_emb.inv_freq', 'model.layers.17.self_attn.rotary_emb.inv_freq', 'model.layers.22.self_attn.rotary_emb.inv_freq', 'model.layers.21.self_attn.rotary_emb.inv_freq', 'model.layers.2.self_attn.rotary_emb.inv_freq', 'model.layers.19.self_attn.rotary_emb.inv_freq', 'model.layers.1.self_attn.rotary_emb.inv_freq', 'model.layers.20.self_attn.rotary_emb.inv_freq', 'model.layers.6.self_attn.rotary_emb.inv_freq', 'model.layers.13.self_attn.rotary_emb.inv_freq', 'model.layers.3.self_attn.rotary_emb.inv_freq', 'model.layers.28.self_attn.rotary_emb.inv_freq', 'model.layers.11.self_attn.rotary_emb.inv_freq', 'model.layers.0.self_attn.rotary_emb.inv_freq', 'model.layers.9.self_attn.rotary_emb.inv_freq', 'model.layers.25.self_attn.rotary_emb.inv_freq', 'model.layers.27.self_attn.rotary_emb.inv_freq', 'model.layers.14.self_attn.rotary_emb.inv_freq', 'model.layers.12.self_attn.rotary_emb.inv_freq', 'model.layers.29.self_attn.rotary_emb.inv_freq', 'model.layers.30.self_attn.rotary_emb.inv_freq', 'model.layers.10.self_attn.rotary_emb.inv_freq', 'model.layers.18.self_attn.rotary_emb.inv_freq', 'model.layers.5.self_attn.rotary_emb.inv_freq', 'model.layers.15.self_attn.rotary_emb.inv_freq', 'model.layers.16.self_attn.rotary_emb.inv_freq', 'model.layers.7.self_attn.rotary_emb.inv_freq', 'model.layers.26.self_attn.rotary_emb.inv_freq', 'model.layers.31.self_attn.rotary_emb.inv_freq', 'model.layers.24.self_attn.rotary_emb.inv_freq', 'model.layers.4.self_attn.rotary_emb.inv_freq']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
2023-12-01 11:46:37,181 : Note: detected 192 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
2023-12-01 11:46:37,181 : Note: NumExpr detected 192 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
2023-12-01 11:46:37,181 : NumExpr defaulting to 8 threads.
2023-12-01 11:47:11,188 : 

***** Transfer task : STSBenchmark*****

Processing dev:   0%|                                                                                                                                    | 0/47 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/data4/kaisi/scaling_sentemb-main/evaluation.py", line 373, in <module>
    main()
  File "/data4/kaisi/scaling_sentemb-main/evaluation.py", line 280, in main
    result = se.eval(task)
  File "/data4/kaisi/scaling_sentemb-main/./SentEval/senteval/engine.py", line 129, in eval
    self.results = self.evaluation.run(self.params, self.batcher)
  File "/data4/kaisi/scaling_sentemb-main/./SentEval/senteval/sts.py", line 76, in run
    enc1 = batcher(params, batch1)
  File "/data4/kaisi/scaling_sentemb-main/evaluation.py", line 217, in batcher
    hidden_states = model(output_hidden_states=True, return_dict=True, **batch).hidden_states
  File "/home/kaisi/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/kaisi/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/caoqian/anaconda3/envs/llama2cn/lib/python3.10/site-packages/peft/peft_model.py", line 946, in forward
    return self.base_model(
  File "/home/kaisi/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/kaisi/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/caoqian/anaconda3/envs/llama2cn/lib/python3.10/site-packages/peft/tuners/tuners_utils.py", line 94, in forward
    return self.model.forward(*args, **kwargs)
  File "/home/caoqian/anaconda3/envs/llama2cn/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/caoqian/anaconda3/envs/llama2cn/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 824, in forward
    logits = self.lm_head(hidden_states)
  File "/home/kaisi/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/kaisi/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/caoqian/anaconda3/envs/llama2cn/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/kaisi/.local/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 114, in forward
    return F.linear(input, self.weight, self.bias)
RuntimeError: expected mat1 and mat2 to have the same dtype, but got: float != c10::Half

似乎是float32和float16想冲突引发的问题,但是我完全按照您的脚本微调,并没有改动,请问我该如何解决这个问题

kongds commented 7 months ago

请先用pip install -r requirements.txt安装一下环境试试

guankaisi commented 7 months ago

因为我的显卡是H100,您的环境中的torch2.0.0和它不兼容,所以我是自己额外配的环境。我通过排列组合尝试在H100上用torch 2.0.0+cu118 和bitsandbytes0.41.2.post2版本可以正常运行