open-mmlab / mmengine

OpenMMLab Foundational Library for Training Deep Learning Models
https://mmengine.readthedocs.io/
Apache License 2.0
1.17k stars 356 forks source link

[Feature Request] Support deepspeed integration #627

Open nijkah opened 2 years ago

nijkah commented 2 years ago

Describe the feature

Motivation Nowadays, deepspeed became a fundamental framework that facilitates training and inference for large-scale or foundation models. We are developing a feature for deepspeed integration into mmengine with support for a deepspeed-specified runner and optim_wrapper.

Does MMEngine have a plan to support deepspeed? Then we can contribute to MMEngine with our implementation :)

Please let me know any guide, plan or opinion about this. :)

C1rN09 commented 2 years ago

Hi, @nijkah We welcome any kind of contribution, and deepspeed integration is definitely what we desire! However, could you make it clearer about "deepspeed-specified runner and optimwrapper"? If you are going to write a new runner that only serves deepspeed models, it seems not quite reasonable and we might need more discussion on it ^^

C1rN09 commented 2 years ago

Hi, @nijkah Have you got any new progress on deepspeed integration? Hope we can discuss on it before you post a PR because it might not be a small & easy one. If you have any ideas/problems/progress, we are always open to have a discussion, either in this issue, or our discussion board.

nijkah commented 2 years ago

Hi, @C1rN09. Our integration development is almost done although there are still several choices left to consider.

Our current implementation supports

  1. Enable ZeRO1, ZeRO2, ZeRO3
  2. Saving a monolithic weight logic (DeepSpeed saves its model weights and optimizer's state in separate files, and the number of saved files are multiplied by the world_size.)

doesn't support yet

  1. FP16 (There is a method to support it! But the solution is quite messy.)
  2. Mixture of Experts
  3. Pipeline Parallelism (It requires the logic to sequentialize MM models.

There are several reasons why we try to write a new deepspeed-dedicated runner. Although we try to follow most of mmengine's Runner logic, there should be some modifications to support deepspeed.

Main logic of DeepSpeedRunner is like below,

        >>> self.model = self.build_model(model)
        >>> self.optim_wrapper = self.build_optim_wrapper(optim_wrapper)
        >>> ds_config = json.load(open(cfg.deepspeed_config))
        >>> self.model, optimizer = deepspeed.initialize(
        >>>    model=self.model,
        >>>    optimizer=self.optim_wrapper.optimizer,
        >>>    model_parameters=self.model.parameters(),
        >>>    config=ds_config)
        >>> self.optim_wrapper.optimizer = optimizer
        >>> self.inject_base_model_methods()

First, the order of logic should be changed when using deepspeed. There was a similar modification in your FSDP PR. It may be ignored in the future. And, to use deepspeed, it seems better to use DeepSpeedEngine's inner logic for optimizers. Then we should give the optimizer variable to deepspeed.initialize or DeepSpeedEngine.

Moreover, DeepSpeedEngine requires users to update parameters by engine.step() which includes optimizer.step and related logic. It made us write a new class for DeepSpeedOptimWrapper.

I think it is better to share our prototype code when we are ready instead of explaining by writing. We can share the link of our repo containing the code before posting the PR.