huggingface / accelerate

🚀 A simple way to launch, train, and use PyTorch models on almost any device and distributed configuration, automatic mixed precision (including fp8), and easy-to-configure FSDP and DeepSpeed support
https://huggingface.co/docs/accelerate
Apache License 2.0
7.71k stars 937 forks source link

DeepSpeedEngineWrapper.backward() does a bit too much #2951

Open zhc7 opened 2 months ago

zhc7 commented 2 months ago

The source code of DeepSpeedEngineWrapper:

class DeepSpeedEngineWrapper:
    """
    Internal wrapper for deepspeed.runtime.engine.DeepSpeedEngine. This is used to follow conventional training loop.

    Args:
        engine (deepspeed.runtime.engine.DeepSpeedEngine): deepspeed engine to wrap
    """

    def __init__(self, engine):
        self.engine = engine

    def backward(self, loss, **kwargs):
        # runs backpropagation and handles mixed precision
        self.engine.backward(loss, **kwargs)

        # Deepspeed's `engine.step` performs the following operations:
        # - gradient accumulation check
        # - gradient clipping
        # - optimizer step
        # - zero grad
        # - checking overflow
        # - lr_scheduler step (only if engine.lr_scheduler is not None)
        self.engine.step()
        # and this plugin overrides the above calls with no-ops when Accelerate runs under
        # Deepspeed, but allows normal functionality for non-Deepspeed cases thus enabling a simple
        # training loop that works transparently under many training regimes.

My question is: Why do we need to do self.engine.step() here immediately? This behavior zeros grad and change the parameter without noticing the user. It might be out of expectation. Since backward step is internally binded with zeroing grad and changing parameter, this blocks users from checking the gradient or parameter manually before stepping.

I know deepspeed-wrapped models can't be seen as normal models, but this behavior still elimiates a lot of flexibility.

github-actions[bot] commented 1 month ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

muellerzr commented 2 weeks ago

Hitting this snag right now actually, will see what we decide to do

muellerzr commented 1 week ago

We're working with the DS team to try and remove the engine entirely, however as a user you can always call model.engine.backward() etc manually without harm in accelerate