shibing624 / MedicalGPT

MedicalGPT: Training Your Own Medical GPT Model with ChatGPT Training Pipeline. 训练医疗大模型,实现了包括增量预训练(PT)、有监督微调(SFT)、RLHF、DPO、ORPO。
Apache License 2.0
3.24k stars 492 forks source link

关于ValueError(f"{tensor_name} is on the meta device, we need a `value` to put in on {device}.") ValueError: weight is on the meta device, we need a `value` to put in on cpu.错误问题 #272

Closed waycup7 closed 8 months ago

waycup7 commented 10 months ago

请教一下, 尝试用llama-13b reward_modeling发现会出现错误讯息: raise ValueError(f"{tensor_name} is on the meta device, we need a value to put in on {device}.") ValueError: weight is on the meta device, we need a value to put in on cpu.

尝试过4bit量化可以解决,但在rl_training在载入marge的reward model会有问题: raise RuntimeError('Error(s) in loading state_dict:While copying the parameter named "model.layers.19.self_attn.o_proj.weight", whose dimensions in the model are torch.Size([5120, 5120]) and whose dimensions in the checkpoint are torch.Size([5120, 5120]), an exception occurred : ('Cannot copy out of meta tensor; no data!',).....

shibing624 commented 10 months ago

显存不足。