timoklein / alphazero-gym

AlphaZero for continuous control tasks
MIT License
23 stars 3 forks source link

Should entropy bonus be also calculated during planning? #10

Closed dbsxdbsx closed 1 year ago

dbsxdbsx commented 1 year ago

Recently, I finished reading this repo code. And I found that the entropy bonus of a state value from SAC is only added at the last output step.

This routine let me can't help but thinking: If the target is to find an action with the best env reward+max entropy, why not calculate it during planning?

timoklein commented 1 year ago

Hi, the entropy bonus is only needed during learning updates: You want to prevent the distribution of your neural network from collapsing. During planning, you don't do any learning updates so there's no need to calculate the entropy.

The entropy bonus is implicitly used in the planning stage by having a higher entropy (i.e. more spread out) distribution from which actions are sampled.

The exploration/exploitation trade-off is handled through progressive widening for the continuous MCTS. If the criterion for adding a new action is met, you explore (by sampling a new action). This action is more diverse due to the entropy bonus applied during network training. If the criterion is not met, you exploit.

Does that help answering your question?

dbsxdbsx commented 1 year ago

@timoklein, thanks for your answer. I just realize that my question may be a small picture of an even big question. Is your thesis published and accessible to read? I want to know more details.

timoklein commented 1 year ago

Yeah sure, go ahead and read it: https://raw.githubusercontent.com/timoklein/ma_thesis/main/THESIS.pdf

dbsxdbsx commented 1 year ago

Yeah sure, go ahead and read it: https://raw.githubusercontent.com/timoklein/ma_thesis/main/THESIS.pdf

Thanks for your thesis. At present, I've read it. Engineeringly, it is quite good since there are many experiments, especially for the part you discussed how to do the selection for MCTS.

And now I think you deserve to know what I mean by "a small picture of an even big question.", which I am confused about.

(NOTE: the below is just what I am thinking, not questioning. But if you have any ideas, please don't hesitate to tell, thanks.)

I want to build an RL algorithm that could be not only robust but also quite general as a benchmark --- The "general" means that it can handle as many environments as possible.

From the aspect of state space, the AlphaZero series is a good candidate, since the successor Muzero can be extended to handle not only chess game, but also atari games (no explicit state transition). Also, the planning method is a way to bridge interpretability with deep learning.

From the aspect of action, SAC is a good candidate, not only because it can produce a stochastic/robust policy, but also because it can be extended to handle complex actions (paper. For this, I also joined a discussion here)

So it is natural to come up with the idea of combining Alphazero with SAC, for which your project and thesis gave 1 possible shoot(Thanks again~).

And I think, when talking about combing Alphazero with SAC, the thing really matters is what entropy exactly represents within an MCTS-based RL algorithm--- that is why I posted this issue here. Somehow your original answer doesn't convince me totally, so I read some papers. Here comes the story.

The 1st paper trying to inject max entropy as a target is from paper SQL, then comes SAC--- mathematically, I am not clear about the relationship between these 2, but seems SAC is more popular as a descendent, and SQL can produce multi-modal policy but SAC can't.(If I am wrong, please tell me).

And in the context of MCTS, there is ments which discusses what max entropy should represent within MCTS during the phase selection in plan and final selection. Then, rents&tents extends it to more kinds of entropy. Finally, ANTS extends again, with a feature that temperature could also be adjusted even during 1 episode (interestingly, these approaches seem to have a strong relation with SQL, but not SAC). And all of them make entropy more active during the whole MCTS 4 PHASES.

Ok, now talking about entropy with MCTS, I can't help but ask myself-- if I want to use (max) entropy properly with MCTS, which approach I should take, yours or the one from ANTS? I also posted an issue for ANTS, but no feedback at present.

The above is what confused me from the aspect of entropy.

  1. If I took your approach (which I prefer because it is easy to understand), there would be a side question on the modal of policy. Since it is known that pure SAC(pure continuous action space) can't produce multi-modal policy, so you take GMM to make it multi-modal, also the author extending SAC to complex actions discussed the topic in the section "Normalizing Flows and SAC" of that paper, where the author wonder why put the model to be learned on the left side of KL divergence since SQL(I am also wondering too).
  2. If I took ANTS as the backbone, things become more complex. Because I am not clear on how to extend it to continuous or even complex action space. For this topic, besides the method "progressive widening" you used, DeepMind also made a paper to extend it to complex action, which even let sampling become possible for pure discrete action space. But there is no code and not quite understandable (I would appreciate it if you could give some thoughts on the relations between this method and yours)

So that is it, that is why I am confused, and currently at a loss.

Again, thanks for your sharing.

timoklein commented 1 year ago

Thanks for the compliments.

Before I write anything further, please note that you cited quite a few papers I'm either not very familiar with or not familiar with at all. So take everything with a big grain of salt.

So it is natural to come up with the idea of combining Alphazero with SAC, for which your project and thesis gave 1 possible shoot(Thanks again~)

It's not my idea, it's from the A0C paper.

The 1st paper trying to inject max entropy as a target is from paper SQL, then comes SAC--- mathematically, I am not clear about the relationship between these 2, but seems SAC is more popular as a descendent, and SQL can produce multi-modal policy but SAC can't.(If I am wrong, please tell me).

I just glanced over the paper, it seems like SAC also uses the Soft-Q update but embeds it into an Actor-Critic framework, which in practice performs better. SAC should have no problems producing multi-modal policies when using a Gaussian mixture policy. In fact, this is something the authors even implemented in their original codebase (see here).

Ok, now talking about entropy with MCTS, I can't help but ask myself-- if I want to use (max) entropy properly with MCTS, which approach I should take, yours or the one from ANTS? I also posted an https://github.com/adaptive-entropy-tree-search/ants/issues/1 for ANTS, but no feedback at present.

I'm not familiar with any of these papers, so these are just some general thoughts. There is a whole bunch of papers extending MCTS to provide fixes for some more or less general scenario. In my personal opinion, they miss the point of what makes MCTS such a great algorithm: It's super simple, widely applicable and scales well with available compute. All of the approaches you referenced seem to increase the complexity by quite a lot. Is that really worth it to achieve marginal gains in performance? Since we already had a working MCTS implementation, we decided against it.

If I took your approach (which I prefer because it is easy to understand), there would be a side question on the modal of policy. Since it is known that pure SAC(pure continuous action space) can't produce multi-modal policy, so you take GMM to make it multi-modal, also the author extending SAC to complex actions discussed the topic in the section "Normalizing Flows and SAC" of that paper, where the author wonder why put the model to be learned on the left side of KL divergence since SQL(I am also wondering too).

I think there's a bunch of nice interpretations about why it has to be on the left-hand side on the KL-divergence wiki.

If I took ANTS as the backbone, things become more complex. Because I am not clear on how to extend it to continuous or even complex action space. For this topic, besides the method "progressive widening" you used, DeepMind also made a paper to extend it to complex action, which even let sampling become possible for pure discrete action space. But there is no code and not quite understandable (I would appreciate it if you could give some thoughts on the relations between this method and yours)

I'm not very familiar with sampled MuZero (and I think it's an awfully written paper that overcomplicates the ideas to look more "mathy"). To me it seems like

Hope that helps a little bit. Again, I'm not very familiar with quite a few things you posted here so I might very well be wrong :)

dbsxdbsx commented 1 year ago

Thanks for your advice. I would digest it.