Open laoda513 opened 1 year ago
My rewards seems even decrasing, despite the decrease in loss
@puyuanOT OK i got the solution. Try to disable the hybirdengine, this make the model always repeat 'a a a a a' not sure the reason.
@puyuanOT OK i got the solution. Try to disable the hybirdengine, this make the model always repeat 'a a a a a' not sure the reason.
Thanks a lot! Will try it out.
Perhaps it's related to this PR https://github.com/microsoft/DeepSpeedExamples/pull/470?
that's another bug I think.
@puyuanOT OK i got the solution. Try to disable the hybirdengine, this make the model always repeat 'a a a a a' not sure the reason.
I also meet this problem and have no idea why this is happening...
I open a new issuse to track this #503
Thank you for letting us know. We are now investigating if HE has any unexpected behavior
Thank you for letting us know. We are now investigating if HE has any unexpected behavior
@yaozhewei I also encountered the same issue at deepspeed==0.9.0 and deepspeed==0.9.1. It can be reproduced by a very simple script. Wish this can help you :) If there is any progress, could you let me know, please
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
import trace
import deepspeed
tracer = trace.Trace(count=True, trace=True)
model = AutoModelForCausalLM.from_pretrained("facebook/opt-6.7b")
tokenizer = AutoTokenizer.from_pretrained("facebook/opt-6.7b", fast_tokenizer=False,use_fast=False)
tokenizer.padding_side = 'left'
ds_config ={ 'train_micro_batch_size_per_gpu': 4, 'steps_per_print': 10, 'zero_optimization': {'stage': 3, 'offload_param': {'device': 'none'}, 'offload_optimizer': {'device': 'none'}, 'stage3_param_persistence_threshold': 10000.0, 'stage3_max_live_parameters': 30000000.0, 'stage3_prefetch_bucket_size': 30000000.0, 'memory_efficient_linear': False}, 'fp16': {'enabled': True, 'loss_scale_window': 100}, 'gradient_clipping': 1.0, 'prescale_gradients': False, 'wall_clock_breakdown': False,
'hybrid_engine': {'enabled': True, 'inference_tp_size': 8, 'release_inference_cache': False, 'pin_parameters': True, 'tp_gather_partition_size': 8}}
engine, *_ = deepspeed.initialize(model=model, config=ds_config)
engine.eval()
sent = ["Human: List five action models\n\nAssistant: ", "Human: hello\n\nAssistant: "]
inputs = tokenizer(sent, padding=True, return_tensors='pt')
inputs = inputs.to(model.device)
gen_kwargs = {"max_length": `512}`
output = engine.module.generate(inputs["input_ids"], **gen_kwargs)
torch.cuda.synchronize()
for o in output:
response = tokenizer.decode(o)
print (response)
This script uses the opt-6.7b model to make predictions. When I turn off HE or turn on HE with an inference_tp_size of 1, the results match my expectations. However, if I turn on HE with an inference_tp_size greater than 1 (such as 2 or 8), the predicted result is ((((, as shown in the figure below.
This is the testing environment I used.
transformers==4.30.0.dev0 deepspeed==0.9.0
@yaozhewei Same error for training Llama, step1 and step 2 are normal, but step 3 just won't converge
I am testing the 1.3B training. Steps 1 and 2 have already passed, but there is no change in reward after completing step 3.
I used LoRa to train for one iteration, and the results of steps 1 and 2 are as follows: step1: ppl: 2.18959641456604
step2:
Step3:
I let chatgpt extracting the logs for step 3 and comparing them with the demo logs provided in the project. I found that the absolute value of my loss is significantly smaller, and the reward seems to be completely random without any noticeable increase. (stand)