Closed lewtun closed 1 year ago
I was not able to reproduce the same issue. Probably the related bugs were already fixed in the latest main.
The report is here https://wandb.ai/costa-huang/trl/reports/deepspeed-test--Vmlldzo1MTc3NDcw
And here is one of the runs: https://wandb.ai/costa-huang/trl/runs/viz5drqj/logs?workspace=user-costa-huang, and its logs seem to indicate deepspeed is running as expected.
Thanks a lot for diving into this! I'm still getting a negative KL even after bumping trl
to main
- can you share the accelerate
and transformers
dependencies you're using?
Ah if you look closely at the KL divergence of your run (https://wandb.ai/costa-huang/trl/runs/viz5drqj?workspace=user-lewtun), one sees that it is indeed still slightly negative:
Since step 0 should be a direct match between the reference & active models, it would make sense to see if we can understand why deepspeed is causing this difference. One possibility is that deepspeed is setting the active mode in train model (e.g. with dropout) while the reference mode in in eval model (no dropout)
Thanks a lot for diving into this! I'm still getting a negative KL even after bumping
trl
tomain
- can you share theaccelerate
andtransformers
dependencies you're using?
https://wandb.ai/costa-huang/trl/runs/viz5drqj/files/requirements.txt has all dependencies. I used your accelerate
config in the issue description.
# config.yaml
compute_environment: LOCAL_MACHINE
deepspeed_config:
gradient_accumulation_steps: 1
offload_optimizer_device: none
offload_param_device: none
zero3_init_flag: false
zero_stage: 2
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
Since step 0 should be a direct match between the reference & active models, it would make sense to see if we can understand why deepspeed is causing this difference. One possibility is that deepspeed is setting the active model in train model (e.g. with dropout) while the reference model in in eval model (no dropout)
Interesting. Thanks for bringing up this point. I will look into it!
An explanation could be that maybe model is not in eval mode? In that case you could have a little bit of noise even if the models are identical.
Closed by #758 (the root cause of the issue was using bf16
mixed precision without properly initialising the reference model with deepspeed)
Hello, while testing out the DeepSpeed ZeRO-2 plugin in the sentiment example for
gpt2
, I noticed that the KL divergence starts out negative. This suggests the model parameters of the reference and active model are being sharded in a peculiar manner that produces a mismatch in the log probs.Below is a screenshot from WandB which shows the pure DDP baseline in teal vs the Z3 curve in purple:
Code to reproduce
I ran this on 2 x A100 (80GB) machines, but that's overkill for this example :)
Accelerate config
Script
Run with
Env