huggingface / accelerate

🚀 A simple way to launch, train, and use PyTorch models on almost any device and distributed configuration, automatic mixed precision (including fp8), and easy-to-configure FSDP and DeepSpeed support
https://huggingface.co/docs/accelerate
Apache License 2.0
7.81k stars 948 forks source link

_prepare_deepspeed fail to capture correct kwargs with DummyOptim or DummyScheduler when calling prepare() multiple times #3134

Open Jason3900 opened 2 weeks ago

Jason3900 commented 2 weeks ago

System Info

accelerate==0.34.2
python==3.10
deepspeed==0.15.1

Information

Tasks

Reproduction

Hey, since I may want to prepare only certain items depending on my training arguments (suppose I don't want to prepare scheduler this time), I decided to order them in a dict and call prepare function multiple time as the number of items are not fixed. After that, I use setattr to re-allocate them to their namespace. It works perfectly util I want to change my code to support deepspeed plugin.

        # handle scheduler manually
        accelerator_to_prepare = OrderedDict(
            [
                ("optimizer", self.optimizer),   
                ("train_dataloader", self.train_dataloader),
                ("valid_dataloader", self.valid_dataloader),
                ("lr_scheduler", self.lr_scheduler),
                ("model", self.model),
            ]
        )
        if self.use_gan:
            accelerator_to_prepare["discriminator"] = self.discriminator

        for k, v in accelerator_to_prepare.items():
            self.print_global_rank_0(f"start prepare {k}")
            setattr(self, k, self.accelerator.prepare(v))

In accelerator's _prepare_deepspeed function, it captures the prepared items and finds the corresponding optimizer and scheduler, then catch the kwargs passed to them to feed in deepspeed config to make all things work. But In my case, I call the accelerate prepare method multiple times, it only captures the last time call, which means the result only contains one item ([model] in my case). Thus it cannot successfully find out the kwargs needed by the optimizer and scheduler (because they're set to "auto" in deepspeed config). Which make the deepspeed_config_process failed with error.

        model = None
        optimizer = None
        scheduler = None
        for obj in result:
            if isinstance(obj, torch.nn.Module):
                model = obj
            elif isinstance(obj, (torch.optim.Optimizer, DummyOptim)):
                optimizer = obj
            elif (isinstance(obj, (LRScheduler, DummyScheduler))) or (
                type(obj).__name__ in deepspeed.runtime.lr_schedules.VALID_LR_SCHEDULES
            ):
                scheduler = obj

Expected behavior

I think accelerate should handle this scenario.

BenjaminBossan commented 1 week ago

Do you really need to call prepare multiple times? You should be able to run prepare in a single call, right?

return_values = self.accelerator.prepare(*accelerator_to_prepare.values())
for k, val in zip(accelerator_to_prepare.keys(), return_values):
    setattr(self, k, val)
Jason3900 commented 1 week ago

Yeah, it's okay. But I think it would be nicer if you pointed it out in the documentation or fixed the logic internally. Otherwise, it might be confusing, and users might struggle to find the problem.

BenjaminBossan commented 1 week ago

The deepspeed init logic is probably not easy to fix, but I'll wait for Zach's return to comment on that. Regarding the docs, yes, probably it should be highlighted that to be on the safe side, there should be a single prepare call containing all that's required.