Open metaqiang opened 5 days ago
I'm not sure why this will happen. I wonder how you modify the num_hidden_layers? Do you make sure that both the vllm model and megatron model configuration are modified correctly?
I have three suggestions for debugging:
Thank you! We modify the num_hidden_layers
by changing the hugging face config file. Is this way wrong? How should we change num_hidden_layers
better?
Description
When running the
examples/ppo_trainer/run_deepseek_megatron.sh
script with the base modeldeepseek-llm-7b-chat
, I encountered an unexpected behavior related to thenum_hidden_layers
parameter. Originally, the model hasnum_hidden_layers
set to 30, and the rollout time is approximately 35 seconds. I modifiednum_hidden_layers
to 15, anticipating that the rollout time would roughly halve. However, the rollout time instead increased to about 71 seconds.Steps to Reproduce
Original Configuration:
examples/ppo_trainer/run_deepseek_megatron.sh
with the base modeldeepseek-llm-7b-chat
havingnum_hidden_layers=30
.Modified Configuration:
num_hidden_layers
parameter in the model configuration from 30 to 15.Expected Behavior
Reducing the
num_hidden_layers
from 30 to 15 should lead to a proportional decrease in rollout generation time, ideally halving the time from around 35 seconds to approximately 17-18 seconds.Actual Behavior
After modifying
num_hidden_layers
to 15, the rollout time unexpectedly doubled from ~35 seconds to ~71 seconds.Additional Information
Model Structure (after reducing):
Timing Code:
Question
What could be causing the rollout time to increase when reducing the
num_hidden_layers
from 30 to 15 in thedeepseek-llm-7b-chat
model? Are there any configuration or implementation issues that might lead to this performance degradation?