GradientState _add_dataloader never called

huggingface / accelerate

🚀 A simple way to launch, train, and use PyTorch models on almost any device and distributed configuration, automatic mixed precision (including fp8), and easy-to-configure FSDP and DeepSpeed support

https://huggingface.co/docs/accelerate

Apache License 2.0

7.34k stars 875 forks source link

GradientState _add_dataloader never called #2831

Closed RuiningLi closed 4 weeks ago

RuiningLi commented 4 weeks ago

During training, I discovered that the gradient doesn't get synced at the end of the dataloader.

After some further investigation, I found this relies on the following code:

if self.gradient_state.sync_with_dataloader and self.gradient_state.end_of_dataloader:
            self.step = 0
            self.gradient_state._set_sync_gradients(True)

which in turn relies on gradient_state to have active_dataloader set to a non-None value. This could be done using the _add_dataloader function defined in GradientState class.

However, in Accelerator, this function is nevered called anywhere. So this logic could not be executed.

Is this the desired behavior? Or am I missing something?

SunMarc commented 4 weeks ago

Hi @RuiningLi, thanks for reporting ! Could you share a minimal reproducer ? That would be very helpful ! cc @muellerzr

muellerzr commented 4 weeks ago

It is called as part of the DataLoaderStateMixin here: https://github.com/huggingface/accelerate/blob/main/src/accelerate/data_loader.py#L384

Which then gets called during the __iter__ of the prepared DataLoaders: https://github.com/huggingface/accelerate/blob/main/src/accelerate/data_loader.py#L448

Also we rely on end_of_dataloader, which gets added during the last batch. So a full reproducer would indeed be necessary, because that should not be the behavior https://github.com/huggingface/accelerate/blob/main/src/accelerate/data_loader.py#470

RuiningLi commented 4 weeks ago

Thanks for your replies @SunMarc @muellerzr ! I will try to get a minimal reproducer. But it's likely it's just me being stupid -- I will close the issue for now!

muellerzr commented 4 weeks ago

No worries, don't hesitate to re-open/ping us if you find the cause!