shibing624 / MedicalGPT

MedicalGPT: Training Your Own Medical GPT Model with ChatGPT Training Pipeline. 训练医疗大模型,实现了包括增量预训练(PT)、有监督微调(SFT)、RLHF、DPO、ORPO。
Apache License 2.0
3.34k stars 499 forks source link

Stage 3: Reward Modeling 报错:**ValueError: weight is on the meta device, we need a `value` to put in on 1.** #35

Closed dage0127 closed 1 year ago

dage0127 commented 1 year ago

Describe the Question

按照run_training_pipeline.ipynb的步骤执行, Stage1,Stage2都执行OK,执行到第三阶段:RM(Reward Model)奖励模型建模时,报错,请帮忙解决。

错误:ValueError: weight is on the meta device, we need a value to put in on 1.

使用命令: python reward_modeling.py \ --model_type bloom \ --model_name_or_path merged-sft \ --train_file_dir ./data/reward \ --validation_file_dir ./data/reward \ --per_device_train_batch_size 3 \ --per_device_eval_batch_size 1 \ --do_train \ --use_peft True \ --seed 42 \ --max_train_samples 1000 \ --max_eval_samples 10 \ --num_train_epochs 1 \ --learning_rate 2e-5 \ --warmup_ratio 0.05 \ --weight_decay 0.001 \ --logging_strategy steps \ --logging_steps 10 \ --eval_steps 50 \ --evaluation_strategy steps \ --save_steps 500 \ --save_strategy steps \ --save_total_limit 3 \ --max_source_length 256 \ --max_target_length 256 \ --output_dir outputs-rm-v1 \ --overwrite_output_dir \ --ddp_timeout 30000 \ --logging_first_step True \ --target_modules all \ --lora_rank 8 \ --lora_alpha 16 \ --lora_dropout 0.05 \ --torch_dtype float32 \ --device_map auto \ --report_to tensorboard \ --ddp_find_unused_parameters False \ --remove_unused_columns False \ --gradient_checkpointing True

报错信息:

2023-06-26 15:01:25.403 | WARNING | main:main:358 - Process rank: -1, device: cuda:0, n_gpu: 2 distributed training: False, 16-bits training: False The argument trust_remote_code is to be used with Auto classes. It has no effect here and is ignored. Some weights of the model checkpoint at merged-sft were not used when initializing BloomForSequenceClassification: ['lm_head.weight']

Describe your attempts

shibing624 commented 1 year ago

colab 上用T4试试,没问题就完全按colab的环境安装,一般是依赖库的版本问题。

dage0127 commented 1 year ago

多谢,已经解决了。 是依赖库的版本问题。

apple2333cream commented 1 year ago

你好 ,请问你是怎么解决的啊 装的是什么依赖包?

dage0127 commented 1 year ago

按requirements.txt中的要求install试试:pip install -r requirements.txt

loguru transformers>=4.30.1 sentencepiece datasets tensorboard tqdm>=4.47.0 peft>=0.3.0 accelerate>=0.20.3 trl