opendilab / DI-engine

OpenDILab Decision AI Engine. The Most Comprehensive Reinforcement Learning Framework B.P.
https://di-engine-docs.readthedocs.io
Apache License 2.0
3.03k stars 370 forks source link

How to create customized model (pointer network) #430

Closed cpwan closed 2 years ago

cpwan commented 2 years ago

Hi there, I am new to DI-engine. I am trying to implement the pointer network for my own environment. The most relevant resource I can find is the docs about the RNN here. It seems that I can treat the pointer network as a kind of RNN and wrap each decoding output as hidden_state . But the encoder (also an LSTM) output is also used in every decoding step. Can I wrap it as another hidden_state ? I noticed from slack that a similar architecture had been implemented in DI-star. Can you give me directions on how to make it work? Also, I am not sure which part of the codes I should modify. It will be good if you can point me to the docs/ tutorial on customizing models.

PaParaZz1 commented 2 years ago

DI-engine doesn't set strict rules for model, but there are some conventions between model and policy. For example, if you use PPO policy, you must define actor and critic in your model and implement some functions like compute_actor and compute_critic. I think the simplest way to customize your own model is to imitate and modify the default model of a policy, like vac for PPO.

As for your case, why do you want to use pointer network, in DI-star, we only use it to output selected_units and maintain hidden state inside the model. And which RL policy do you want to use? Please provide more information.

cpwan commented 2 years ago

Thanks for your reply. Let me check that out.

As per the motivation, I want to use the pointer network to solve the Traveling salesman problem (which has a dynamic number of inputs). I have seen works that train the pointer network with the classical REINFORCE algorithm. I would like to do experiments with other more advanced RL policies.

PaParaZz1 commented 2 years ago

OK I get your point. And I think it is important to model a proper MDP problem if you want to use any RL algorithms. I noticed the original paper of pointer network for TSP is a kind of supervised learning, and different MDP modeling types will make essential contribution to your final implementation and performance.

For example, if you input a state and want to output the entire path per step, you can put the hidden state implementation inside of your network model. But if you want to output the following location per step, you need to modify policy to maintain hidden state, like the differences between R2D2 and DQN.