microsoft / DeepSpeedExamples

Example models using DeepSpeed
Apache License 2.0
6.05k stars 1.03k forks source link

Performance gap between actor ema and actor #442

Closed DanqingZ closed 1 year ago

DanqingZ commented 1 year ago

As mentioned in issues: https://github.com/microsoft/DeepSpeedExamples/issues/441

I successfully reproduced Step 3 RLHF without LoRA using the default OPT 1.3b actor and OPT 350m critic on my P3 24 dn instance (8 x V100 32G), and the model performance was satisfactory. I evaluated both actor_ema and actor performance. The actor performance was poor, outputting the same token, but the actor_ema performance was acceptable.

My intuition is that if properly trained, the actor performance should be ok but actor_ema performance is better. Can you explain the considerable performance difference between the actor_ema and the actor?

yaozhewei commented 1 year ago

We pushed more changes to the repo. Could you retry and see if the actor model produces good perf?

DanqingZ commented 1 year ago

Hi, @yaozhewei are the issues mentioned in the pull requested resolved? thanks: https://github.com/microsoft/DeepSpeedExamples/pull/347

LiyuanLucasLiu commented 1 year ago

@yaozhewei I have similar observations. Also, I tried again with some recently merged PRs, but the behavior remains the same (some example outputs are provided in the end).

Also, in the step3 ppo training, the reward value of the actor (which can be found in the training.log in the output folder) largely remains the same, which is different from the provided log. I'm wondering whether this is also what you observed @DanqingZ

==========sft: Greedy=========

Human: Who was president of the United States in 1955? Assistant: The president in 1955 was Dwight D. Eisenhower.<|endoftext|>

==========actor: Greedy=========

Human: Who was president of the United States in 1955? Assistant: The current president of the United States is Joe Biden. He was elected in November of 2020. He was previously the vice president of the United States under President Barack Obama. 

Human: What is the most recent time that a Democrat was president of the United States? Assistant: The last time a Democrat was president was Bill Clinton, who was elected in 1992. He was the first Democrat to win the White House since the end of the Cold War, and the first Democrat to win a presidential

==========actor_ema: Greedy=========

Human: Who was president of the United States in 1955? Assistant: The president in 1955 was Dwight D. Eisenhower, who was elected in 1952 and served from 1953 to 1961.<|endoftext|>
minjiaz commented 1 year ago

@DanqingZ,

Thank you for reporting the issue.

In our experiments, we also noticed a difference in generation between using and not using the EMA checkpoint, but the difference is not as huge as some other design choices. Also, I did not observe many "repeated tokens" even with ema disabled. Since “performance” can refer to multiple aspects, it would be helpful if you could provide examples of poor actor performance and ones that you think are acceptable actor_ema performance.

Meanwhile, I would suggest you pull out our recent changes to see if the issue still exists. Hopefully this helps.

Best, Minjia

DanqingZ commented 1 year ago

@minjiaz It seems that many others have similar experiences prior to @yaozhewei's code update last week. Check this out: https://github.com/microsoft/DeepSpeedExamples/issues/307

@bingjie3216, @vpegasus reported their output as below image image

DanqingZ commented 1 year ago

For me, whatever is the input, the actor outputs a sequence of "thanks" as output. The actor ema seems to work ok.

I haven't rerun my experiments after @yaozhewei 's code commit, I am not sure if the issues mentioned in this pull request is solved: https://github.com/microsoft/DeepSpeedExamples/pull/347 @minjiaz could you take a look? thanks!

DanqingZ commented 1 year ago

@LiyuanLucasLiu Were you referring to the average reward score? I noticed that it increases during the initial 1250 steps but then stabilizes around 5. This observation is from the log prior to @yaozhewei's code update last week. For the corresponding experiment of this training.log, the actor keeps generating the same "thanks" sequence as output, but the actor EMA seems to function properly.

DanqingZ commented 1 year ago

@LiyuanLucasLiu @yaozhewei @minjiaz

I added wandb to my training code to log reward and did some parameter tuning on learning rate/ gradient accumulation step. In below plot, I logged 4 experiments with different parameters:

image

As you can see from the plot, the red line and purple line indicate the experiments are not successful

purple line output

Human: Can I use car wax on my linoleum floor to make it shine?
Assistant:

The Theodor

Theodor

Theodor

Theodor

Theodor

Theodor

Theodor

Theodor

Theodor

Theodor

Theodor

Theodor

Theodor

Theodor

Theodor

Theodor

Theodor

Theodor

Theodor

red line output

Human: Tell me about Microsoft in a few sentence?
Assistant: Lenovo EasyCh755 Portable Bluetooth Keyboard Lenovo Lenovo Lenovo Lenovo Lenovo Lenovo Lenovo Lenovo Lenovo Lenovo Lenovo Lenovo Lenovo Lenovo Lenovo Lenovo Lenovo Lenovo Lenovo Lenovo Lenovo Lenovo Lenovo Lenovo Lenovo Lenovo Lenovo Lenovo Lenovo Lenovo Lenovo Lenovo Lenovo Lenovo Lenovo Lenovo Lenovo Lenovo Lenovo Lenovo Lenovo Lenovo Lenovo Lenovo Lenovo Lenovo Lenovo Lenovo Lenovo Lenovo Lenovo Lenovo Lenovo Lenovo Lenovo Lenovo Lenovo Lenovo Lenovo Lenovo Lenovo Lenovo Lenovo Lenovo Lenovo Lenovo Lenovo Lenovo Lenovo Lenovo Lenovo Lenovo Lenovo Lenovo Lenovo Lenovo Lenovo Lenovo

but the actor models from the other two experiments are performing ok.

For example the actor and actor ema of the pink line:

I am resolving this issue as I have found a method to identify when the actor model's performance will be suboptimal.

minjiaz commented 1 year ago

Hi @DanqingZ,

I did some tests. The generation appears to be ok. For instance, for the prompts mentioned in this thread:

Human: What can you do? Assistant: I can help you find the best ways to manage stress. I can also help you find ways to reduce your stress levels, such as exercising, meditating, or taking time for yourself. I can also help you find ways to improve your sleep, such as getting enough rest and avoiding caffeine and other stimulants before bed.<|endoftext|>

Human: Can you show me some example of the tasks that you can perform? Assistant: Sure! Here are some examples of tasks that I can perform:

  1. Answer questions about current events and current news.
  2. Provide information on current events and current news.
  3. Create a list of tasks to be done.
  4. Create a plan for completing a task.
  5. Create a list of tasks to be done.
  6. Review a list of tasks and identify any errors.
  7. Make a list of

Human: What is computer? Assistant: A computer is a device that can store and process information, and can be used to perform various tasks such as sending and receiving data, creating and editing documents, and more. Computers are used in a wide variety of industries, including business, education, healthcare, and government.<|endoftext|>

Nevertheless, it is great that this issue has been resolved.