Closed marhlder closed 6 days ago
Hi @marhlder, thanks for the detailed report. This is indeed a big issue. If you have time, could you share a minimal reproducer ? Does this happen only in dpp setup for also when training on only one gpu ? Thanks a lot ! cc @muellerzr
We have not tested this on a single GPU setup yet as single A100 GPU configurations in GCP is currently "unobtainium". It's gonna take me some time to try to replicate this into a small re-producible example as our current setup is is quite modular / split into many files.
Got it ! Keep us updated ! In the meantime, we will also try to replicate and fix the issue ! cc @muellerzr
Hmm it appears that it's maybe not relevant to Accelerate. I mistakenly though it was fixed just by downgrading to 0.26.1, but it seems it's not. I will investigate further.
Okay, I'm sorry, but it was a false alarm after all. It turns out that I was also switching between two data setups when I was switching between the two versions of Accelerate.
It turns out that the real underlying issue is this thing from Huggingface's Datasets library: https://github.com/huggingface/datasets/issues/6637
What got me confused was that the, by GCP, reported GPU utilization was still very high, so I didn't suspect that it was a data loading problem. But I guess it was possibly copying back and forth between CPU / GPU or doing some kinda polling to get the data?
Anyways, not using the with_format() API and performing my own map() operation to convert my values into tensors seems to work much better, it's still slower overall though. But NOT due to Accelerate it seems.
Awesome ! Thanks for the update @marhlder !
System Info
Information
Tasks
no_trainer
script in theexamples
folder of thetransformers
repo (such asrun_no_trainer_glue.py
)Reproduction
We are training a rather large 1.3 Billion parameters T5 model with MoE / Switch-mechanisms on a 16 x A100 GPU machine in GCP. The model works in long input sequences 3072 and shorter output sequences of 192. We tried to update our accelerate library version to the latest version (0.31.0) from 0.26.1 but we experienced a huge increase in training time. The model was still seemingly learning well without any issues. The main motivation for updating our dependency was this issue: https://github.com/huggingface/accelerate/issues/1050 Which didn't seem to be fixed anyways.
The Accelerator object is configured like this:
The model object is compiled before being sent through the accelerate prepare method
self.model.model = torch.compile(self.model.model)
Accelerate prepare is called like this:
Main training loop looks something like this:
The optimizer_step() function is defined like this:
Logs from before:
Logs after updating:
These logs show that each iteration over the dataset in the training loop are now significantly slower.
Rolling back to 0.26.1 brings the performance back to the expected levels. We are running with gradient accumulation = 16 CPU and GPU utilization seems comparable in both cases, around 20% for CPU and 96% on average for GPU.
Expected behavior
Expected behavior is similar or possibly better performance when updating Accelerate, or some kind of documentation of what we need to change in order to bring us back to the expected performance.