opendilab / DI-engine

OpenDILab Decision AI Engine. The Most Comprehensive Reinforcement Learning Framework B.P.
https://di-engine-docs.readthedocs.io
Apache License 2.0
3.09k stars 373 forks source link

How to introduce other optimizers into DI-engine? #813

Open weidaolee opened 4 months ago

weidaolee commented 4 months ago

I would like to incorporate optimizers from the parameterfree library into RL training. However, I've noticed that DI-engine has hard-coded the optimizer in most of its RL algorithms. e.g:

I have a few questions regarding this:

  1. Why is the optimizer hard-coded in most of DI-engine's RL algorithms?

  2. What is the best practice if I really want to use a different optimizer?

  3. Why does DI-engine implement a different optimizer for each that from Pytorch instead of using a design pattern like the strategy pattern or template pattern, which would enable users to easily compose or implement their optimizers?

  4. Based on my understanding, the optimizer and the "thing" to clip are independent. Therefore, is decoupling the RL algorithm and optimizer feasible?

Feel free to correct any misunderstandings or gaps in my knowledge. If you have plans to refactor this part, I am willing to contribute code.

PaParaZz1 commented 4 months ago

Thanks for your attention about DI-engine. Now I will give some basic explanations about your questions:

  1. Why is the optimizer hard-coded in most of DI-engine's RL algorithms?

In most RL problems, the AdamW optimizer with LambdaLRScheduler is enough to acquire a group of excellent hyper-parameters. This is often no need to adjust this part frequently like other machine learning tasks. On the other hand, some RL algorithms like DDPG often need multiple optimizers rather than a single optimizer in algorithms like DQN, thus it is not easy to abstract a unified optimizer interface for all the policies.

  1. What is the best practice if I really want to use a different optimizer?

If you want to use different optimizer, you can indeed modify the relevant policy file to suit your specific requirements. The DI-engine framework is designed with flexibility in mind, allowing users to customize the model, policy, and env modules according to their needs. Due the complexity and rapid development of RL area, it is difficult to abstract a fixed algorithm configurations and implementation. Thus, we implement this policy files like a kind of template for different users, and we hope these files could be easy-to-hack.

3 & 4

There are indeed some kinds of coding design patterns about how to elegantly compose different modules. However, in the early stage of DI-engine, we thought the optimizer is important to hack in the large-scale RL training such as DI-star. In this scenario, we modified the optimizer in-place to implement some special gradient clipping operations with little overhead of GPU memory. And this historical implementation is left to today's version. From the viewpoint of most classical RL algorithms, I think it is feasible to decoupling the RL algorithm and optimizer, but it needs some detailed design to figure out the problems I mentioned above. If interested, we can provide the corresponding support about the refactor plan.

weidaolee commented 4 months ago

Hello, thank you so much for your explanation. This is my imagination of this refactoring:

The TemplateOptimizer has a template method called step. The before_step serves as a hook method that can be deferredly implemented by subclasses. We can use __getattr__ to delegate methods and attributes to self.optimizer so that TemplateOptimizer can act as a torch.optim.Optimizer.

With this implementation, we can separate the clipping and ignoring method from the optimizers. Then, we can also separate the RL algorithms from the optimizers.

However, after carefully studying the implementation of DI-engine, I feel that refactoring this part does introduce a lot of complexity. If maintenance and modification costs outweigh the benefits, perhaps maintaining the existing implementation is a better choice.

I still provide a sketch of this idea. If you evaluate it and think it is not necessary for any reason, you can just close this issue. Alternatively, if you think this refactoring is valuable, I'd be willing to discuss it further.

Thanks again for your reply. By the way, your code is very good and it makes me feel happy physically and mentally while reading it.

class TemplateOptimizer(torch.optim.Optimizer):
    def __init__(self, optimizer: torch.optim.Optimizer):
        self.optimizer = optimizer
        # Initialize the base Optimizer with the parameters of the passed optimizer
        super(TemplateOptimizer, self).__init__(
            optimizer.param_groups, optimizer.defaults
        )

    def before_step(self):
        # This method can be overridden by subclasses if needed
        ...

    def step(self, closure=None):
        self.before_step()
        self.optimizer.step(closure)

    def __getattr__(self, name):
        return getattr(self.optimizer, name)

class DingOptimizer(TemplateOptimizer):
    def __init__(
        self,
        optimizer: torch.optim.Optimizer,
        optim_type: str = "",
        grad_clip_type: str = None,
        clip_value: Union[float, None] = None,
        clip_coef: float = 5,
        clip_norm_type: float = 2.0,
        clip_momentum_timestep: int = 100,
        grad_norm_type: str = None,
        grad_ignore_type: str = None,
        ignore_value: Union[float, None] = None,
        ignore_coef: float = 5,
        ignore_norm_type: float = 2.0,
        ignore_momentum_timestep: int = 100,
    ):
        super(DingOptimizer, self).__init__(optimizer)
        self.optim_type = optim_type
        self.grad_clip_type = grad_clip_type
        self.clip_value = clip_value
        self.clip_coef = clip_coef
        self.clip_norm_type = clip_norm_type
        self.clip_momentum_timestep = clip_momentum_timestep
        self.grad_norm_type = grad_norm_type
        self.grad_ignore_type = grad_ignore_type
        self.ignore_value = ignore_value
        self.ignore_coef = ignore_coef
        self.ignore_norm_type = ignore_norm_type
        self.ignore_momentum_timestep = ignore_momentum_timestep

    def clipping(self):
        ...
    def ignoring(self):
        ...

    @override
    def before_step(self):
        self.clipping()
        self.ignoring()