Closed dage0127 closed 1 year ago
colab 上用T4试试,没问题就完全按colab的环境安装,一般是依赖库的版本问题。
多谢,已经解决了。 是依赖库的版本问题。
你好 ,请问你是怎么解决的啊 装的是什么依赖包?
按requirements.txt中的要求install试试:pip install -r requirements.txt
loguru transformers>=4.30.1 sentencepiece datasets tensorboard tqdm>=4.47.0 peft>=0.3.0 accelerate>=0.20.3 trl
Describe the Question
按照run_training_pipeline.ipynb的步骤执行, Stage1,Stage2都执行OK,执行到第三阶段:RM(Reward Model)奖励模型建模时,报错,请帮忙解决。
错误:ValueError: weight is on the meta device, we need a
value
to put in on 1.使用命令: python reward_modeling.py \ --model_type bloom \ --model_name_or_path merged-sft \ --train_file_dir ./data/reward \ --validation_file_dir ./data/reward \ --per_device_train_batch_size 3 \ --per_device_eval_batch_size 1 \ --do_train \ --use_peft True \ --seed 42 \ --max_train_samples 1000 \ --max_eval_samples 10 \ --num_train_epochs 1 \ --learning_rate 2e-5 \ --warmup_ratio 0.05 \ --weight_decay 0.001 \ --logging_strategy steps \ --logging_steps 10 \ --eval_steps 50 \ --evaluation_strategy steps \ --save_steps 500 \ --save_strategy steps \ --save_total_limit 3 \ --max_source_length 256 \ --max_target_length 256 \ --output_dir outputs-rm-v1 \ --overwrite_output_dir \ --ddp_timeout 30000 \ --logging_first_step True \ --target_modules all \ --lora_rank 8 \ --lora_alpha 16 \ --lora_dropout 0.05 \ --torch_dtype float32 \ --device_map auto \ --report_to tensorboard \ --ddp_find_unused_parameters False \ --remove_unused_columns False \ --gradient_checkpointing True
报错信息:
2023-06-26 15:01:25.403 | WARNING | main:main:358 - Process rank: -1, device: cuda:0, n_gpu: 2 distributed training: False, 16-bits training: False The argument
trust_remote_code
is to be used with Auto classes. It has no effect here and is ignored. Some weights of the model checkpoint at merged-sft were not used when initializing BloomForSequenceClassification: ['lm_head.weight']value
to put in on {device}.") ValueError: weight is on the meta device, we need avalue
to put in on 1.Describe your attempts