Stage 3: Reward Modeling 报错：**ValueError: weight is on the meta device, we need a `value` to put in on 1.**

dage0127 commented 1 year ago

Describe the Question

按照run_training_pipeline.ipynb的步骤执行， Stage1,Stage2都执行OK，执行到第三阶段：RM(Reward Model)奖励模型建模时，报错，请帮忙解决。

错误：ValueError: weight is on the meta device, we need a value to put in on 1.

使用命令： python reward_modeling.py \ --model_type bloom \ --model_name_or_path merged-sft \ --train_file_dir ./data/reward \ --validation_file_dir ./data/reward \ --per_device_train_batch_size 3 \ --per_device_eval_batch_size 1 \ --do_train \ --use_peft True \ --seed 42 \ --max_train_samples 1000 \ --max_eval_samples 10 \ --num_train_epochs 1 \ --learning_rate 2e-5 \ --warmup_ratio 0.05 \ --weight_decay 0.001 \ --logging_strategy steps \ --logging_steps 10 \ --eval_steps 50 \ --evaluation_strategy steps \ --save_steps 500 \ --save_strategy steps \ --save_total_limit 3 \ --max_source_length 256 \ --max_target_length 256 \ --output_dir outputs-rm-v1 \ --overwrite_output_dir \ --ddp_timeout 30000 \ --logging_first_step True \ --target_modules all \ --lora_rank 8 \ --lora_alpha 16 \ --lora_dropout 0.05 \ --torch_dtype float32 \ --device_map auto \ --report_to tensorboard \ --ddp_find_unused_parameters False \ --remove_unused_columns False \ --gradient_checkpointing True

报错信息：

2023-06-26 15:01:25.403 | WARNING | main:main:358 - Process rank: -1, device: cuda:0, n_gpu: 2 distributed training: False, 16-bits training: False The argument trust_remote_code is to be used with Auto classes. It has no effect here and is ignored. Some weights of the model checkpoint at merged-sft were not used when initializing BloomForSequenceClassification: ['lm_head.weight']

This IS expected if you are initializing BloomForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
This IS NOT expected if you are initializing BloomForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). Traceback (most recent call last): File "reward_modeling.py", line 642, in main() File "reward_modeling.py", line 380, in main model = model_class.from_pretrained( File "/home/haitaiwork/llm/anaconda3/envs/gpt/lib/python3.8/site-packages/transformers/modeling_utils.py", line 2846, in from_pretrained dispatch_model(model, device_map=device_map, offload_dir=offload_folder, offload_index=offload_index) File "/home/haitaiwork/llm/anaconda3/envs/gpt/lib/python3.8/site-packages/accelerate/big_modeling.py", line 370, in dispatch_model attach_align_device_hook_on_blocks( File "/home/haitaiwork/llm/anaconda3/envs/gpt/lib/python3.8/site-packages/accelerate/hooks.py", line 502, in attach_align_device_hook_on_blocks attach_align_device_hook_on_blocks( File "/home/haitaiwork/llm/anaconda3/envs/gpt/lib/python3.8/site-packages/accelerate/hooks.py", line 478, in attach_align_device_hook_on_blocks add_hook_to_module(module, hook) File "/home/haitaiwork/llm/anaconda3/envs/gpt/lib/python3.8/site-packages/accelerate/hooks.py", line 155, in add_hook_to_module module = hook.init_hook(module) File "/home/haitaiwork/llm/anaconda3/envs/gpt/lib/python3.8/site-packages/accelerate/hooks.py", line 251, in init_hook set_module_tensor_to_device(module, name, self.execution_device) File "/home/haitaiwork/llm/anaconda3/envs/gpt/lib/python3.8/site-packages/accelerate/utils/modeling.py", line 140, in set_module_tensor_to_device raise ValueError(f"{tensor_name} is on the meta device, we need a value to put in on {device}.") ValueError: weight is on the meta device, we need a value to put in on 1.

Describe your attempts

[ *] I walked through the tutorials
[ *] I checked the documentation
[* ] I checked to make sure that this is not a duplicate question

shibing624 commented 1 year ago

colab 上用T4试试，没问题就完全按colab的环境安装，一般是依赖库的版本问题。

dage0127 commented 1 year ago

多谢，已经解决了。是依赖库的版本问题。

apple2333cream commented 1 year ago

你好，请问你是怎么解决的啊装的是什么依赖包？

dage0127 commented 1 year ago

按requirements.txt中的要求install试试：pip install -r requirements.txt

loguru transformers>=4.30.1 sentencepiece datasets tensorboard tqdm>=4.47.0 peft>=0.3.0 accelerate>=0.20.3 trl

shibing624 / MedicalGPT