Closed Jasonlee1995 closed 5 months ago
~I found that on new docker container it works well I expected, but weird things happen when running on conda env.~
sorry for confusion, it behaves weird when using multi-gpus
Can you do a pip freeze
in both for us please? :)
And the output of accelerate env
(it tells us more than just what you entered when doing accelerate config
! :))
Sorry for confusion 🥲 - I only tested on docker container running only on cpu, not multi-gpu :( If I test above test code on the docker container with multi-gpu, it behaves weird as I reported...
nvcr.io/nvidia/pytorch:23.12-py3
imageapt-get update
, apt-get upgrade
pip install transformers, accelerate, timm
I share my pip freeze results and accelerate env, and I'll also share the env if I solve this issue...
At first, I think my machine is weird, but I also run the same test code on other machine with 3090 x 2 GPU, the same problem occurs (lr scheduler behaves as each gpu calls)
I think I'm missing something (is there any possible bugs in my test code???)
Oh wait. Just reread this issue.
(it behaves like all the gpus call the learning rate scheduler - warmup 8 / 4 = 2, 32 / 4 = 8)
yes. That’s exactly how our scheduler wrapper behaves (and how you should step in multi-GPU)
yes. That’s exactly how our scheduler wrapper behaves (and how you should step in multi-GPU)
I hope adding this information on docs or somewhere maybe useful for newbie like me! (I keep thought about what parts of my machine is causing these errors)
I looked at accelerate cv example, accelerate nlp example, transformers scheduler docs, accelerate tutorials but hard to get to the conclusion 🥲
Accelerate works by splitting up the dataloader between all GPUs, so one epoch is faster (every GPU sees a different subset). In this same vein, we can then also increase the LR by n-gpu
steps since we are doing this at once.
Check out the debugging guide which helps talk about this: https://huggingface.co/docs/accelerate/concept_guides/performance
System Info
Information
Tasks
no_trainer
script in theexamples
folder of thetransformers
repo (such asrun_no_trainer_glue.py
)Reproduction
I used the test code
test.py
below on my machine with 4 A6000 GPUs.I run
NCCL_P2P_DISABLE=1 accelerate launch test.py
on terminal. (if I do not useNCCL_P2P_DISABLE=1
, the train doesn't work so I add it)Expected behavior
Since I used
warmup_steps=8, num_training_steps=32
, I expect that I would get similar learning rate graph like above. (captured from huggingface Optimization)But when I run and track the learning rate, they do not behave like expected.
Warmup steps are only 2, and cosine cycles are not working as I expected. (it behaves like all the gpus call the learning rate scheduler - warmup 8 / 4 = 2, 32 / 4 = 8)