microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
35.28k stars 4.09k forks source link

[BUG]RuntimeError: The size of tensor a (3072) must match the size of tensor b (4096) at non-singleton dimension 0 #4302

Open 4daJKong opened 1 year ago

4daJKong commented 1 year ago

Describe the bug I take facebook opt-350m as actor model and opt-125m as critic model and successfully finished step1 and step 2. While evaluating them by eval.py provided in evaluation_scripts, actor model can return a SFT result in step1.

Besides, in step2, I train 5 epochs and here is some log,

Epoch 5/5 with loss 0.45925239894701086
***** Evaluating reward, Epoch 5/5 *****
Invalidate trace cache @ step 0: expected module 0, but got module 14
chosen_last_scores (higher is better) : 1.9252854585647583, acc (higher is better) : 0.44843748211860657

when I run eval.py to evaluate it, it shows,

=============Scores (higher, better)========================
good_ans score:  3.6553025245666504
bad_ans score:  -8.21585464477539

So I guess there is no fatal error here...

After that, I continue to run step3, but when I add --enable_hybrid_engine , it shows

in _fuse_lora
    weight.data += lora_scaling * torch.matmul(lora_left_weight.t(), lora_right_weight.t())
RuntimeError: The size of tensor a (3072) must match the size of tensor b (4096) at non-singleton dimension 0

I tried to add

self.actor_model.empty_partition_cache()
self.critic_model.empty_partition_cache()

in train_rlhf func before return value in ppo_trainer.py

Or

taking ACTOR_ZERO_STAGE and CRITIC_ZERO_STAGE to 0 but both of them didn't work

So have to remove--enable_hybrid_enginebut the training.log (I attached it and actor model config training.log config.txt below)

look like that,

Invalidate trace cache @ step 0: expected module 14, but got module 0
Invalidate trace cache @ step 210: expected module 690, but got module 689
epoch: 0|step: 1903|ppo_ep: 1|act_loss: 0.03200531005859375|cri_loss: 0.046112060546875|unsuper_loss: 0.0
average reward score: -1.482421875

and it didn't shows this kind of message when I check other's training result in huggingface

|E2E latency=3.26s |Gather latency=0.00s (0.00%) |Generate time=2.43s (74.51%) |Training time=0.64s (19.62%) |Others=0.19 (5.87%)|CurSamplesPerSec=2.45 |AvgSamplesPerSec=2.35
epoch: 0|step: 3682|ppo_ep: 1|act_loss: -0.002838134765625|cri_loss: 0.016815185546875|unsuper_loss: 0.0
average reward score: 4.2421875

I was wondering if --enable_hybrid_engine, this parameter cause this problem, if so how could I correct? If not, why it shows a weired result if I use my trained actor model?

def get_generator(path):
    tokenizer = AutoTokenizer.from_pretrained(path, fast_tokenizer=True)
    tokenizer.pad_token = tokenizer.eos_token
    model_config = AutoConfig.from_pretrained(path)
    model = OPTForCausalLM.from_pretrained(path,
                                           from_tf=bool(".ckpt" in path),
                                           config=model_config).half()
    generator = pipeline("text-generation",
                         model=model,
                         tokenizer=tokenizer,
                         device="cuda:0")
    return generator

path_step3_actor = "/data/coco/deepspeed_chat/output/output_step3_2/actor"
prompt_1 = "Do you know Microsoft?"
generator = get_generator(path) 
output = generator(prompt,max_length=256)

it shows, [{'generated_text': 'Do you know Microsoft?ekrekrekrekrekrekrekrekrekrekrekrekrekrekrekrekrekrekrekrekrekrekrekrekrekrekrekrekrekrekrekrekrekrekrekrekrekrekrekrekrekrekrekrek...'}]

4daJKong commented 1 year ago

NVIDIA GPU: T4

NVIDIA Driver Version: 515.105.01

CUDA Version: 11.7

CUDNN Version: 8.9.4.25_cuda11

Operating System: CentOS Linux release 7.6.1810

Python Version (if applicable): 3.8.17

PyTorch Version (if applicable): 2.0.1