microsoft / DeepSpeedExamples

Example models using DeepSpeed
Apache License 2.0
6.12k stars 1.05k forks source link

The reward in step3 seems to be completely random without any noticeable increase. #489

Open laoda513 opened 1 year ago

laoda513 commented 1 year ago

I am testing the 1.3B training. Steps 1 and 2 have already passed, but there is no change in reward after completing step 3.

I used LoRa to train for one iteration, and the results of steps 1 and 2 are as follows: step1: ppl: 2.18959641456604

step2: image

Step3: image

I let chatgpt extracting the logs for step 3 and comparing them with the demo logs provided in the project. I found that the absolute value of my loss is significantly smaller, and the reward seems to be completely random without any noticeable increase. (stand)

image

image image image

puyuanOT commented 1 year ago

My rewards seems even decrasing, despite the decrease in loss W B Chart 07_05_2023, 17_15_56 W B Chart 07_05_2023, 17_15_50 W B Chart 07_05_2023, 17_15_13

laoda513 commented 1 year ago

@puyuanOT OK i got the solution. Try to disable the hybirdengine, this make the model always repeat 'a a a a a' not sure the reason.

puyuanOT commented 1 year ago

@puyuanOT OK i got the solution. Try to disable the hybirdengine, this make the model always repeat 'a a a a a' not sure the reason.

Thanks a lot! Will try it out.

puyuanOT commented 1 year ago

Perhaps it's related to this PR https://github.com/microsoft/DeepSpeedExamples/pull/470?

laoda513 commented 1 year ago

that's another bug I think.

REIGN12 commented 1 year ago

@puyuanOT OK i got the solution. Try to disable the hybirdengine, this make the model always repeat 'a a a a a' not sure the reason.

I also meet this problem and have no idea why this is happening...

laoda513 commented 1 year ago

I open a new issuse to track this #503

yaozhewei commented 1 year ago

Thank you for letting us know. We are now investigating if HE has any unexpected behavior

beichengus commented 1 year ago

Thank you for letting us know. We are now investigating if HE has any unexpected behavior

@yaozhewei I also encountered the same issue at deepspeed==0.9.0 and deepspeed==0.9.1. It can be reproduced by a very simple script. Wish this can help you :) If there is any progress, could you let me know, please

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
import trace
import deepspeed
tracer = trace.Trace(count=True, trace=True)
model = AutoModelForCausalLM.from_pretrained("facebook/opt-6.7b")
tokenizer = AutoTokenizer.from_pretrained("facebook/opt-6.7b", fast_tokenizer=False,use_fast=False)
tokenizer.padding_side = 'left'
ds_config ={ 'train_micro_batch_size_per_gpu': 4, 'steps_per_print': 10, 'zero_optimization': {'stage': 3, 'offload_param': {'device': 'none'}, 'offload_optimizer': {'device': 'none'}, 'stage3_param_persistence_threshold': 10000.0, 'stage3_max_live_parameters': 30000000.0, 'stage3_prefetch_bucket_size': 30000000.0, 'memory_efficient_linear': False}, 'fp16': {'enabled': True, 'loss_scale_window': 100}, 'gradient_clipping': 1.0, 'prescale_gradients': False, 'wall_clock_breakdown': False,
            'hybrid_engine': {'enabled': True, 'inference_tp_size': 8, 'release_inference_cache': False, 'pin_parameters': True, 'tp_gather_partition_size': 8}}
engine, *_ = deepspeed.initialize(model=model, config=ds_config)
engine.eval()
sent = ["Human: List five action models\n\nAssistant: ", "Human: hello\n\nAssistant: "]
inputs = tokenizer(sent, padding=True, return_tensors='pt')
inputs = inputs.to(model.device)
gen_kwargs = {"max_length": `512}`
output = engine.module.generate(inputs["input_ids"], **gen_kwargs)
torch.cuda.synchronize()
for o in output:
    response = tokenizer.decode(o)
    print (response)

This script uses the opt-6.7b model to make predictions. When I turn off HE or turn on HE with an inference_tp_size of 1, the results match my expectations. However, if I turn on HE with an inference_tp_size greater than 1 (such as 2 or 8), the predicted result is ((((, as shown in the figure below. image

This is the testing environment I used.

image transformers==4.30.0.dev0 deepspeed==0.9.0

AlisonWen commented 1 year ago

@yaozhewei Same error for training Llama, step1 and step 2 are normal, but step 3 just won't converge