google-deepmind / dm_construction

Apache License 2.0
27 stars 7 forks source link

Asking DQN-MCTS baseline code #2

Open WenyuHan-LiNa opened 3 years ago

WenyuHan-LiNa commented 3 years ago

Hello authors, I am very interested in your work. I am working on a DRL related work. Now, I am planning to add a DQN with MCTS to my project as you did. Would you please share the code or some implementation details about this MCTS baseline in your paper? Thank you ahead!

jhamrick commented 3 years ago

Hi @WenyuHan-LiNa, thanks for your interest in our work!

Unfortunately I am not able to share the code for our MCTS implementation, but if you just want to do a standard DQN plus MCTS, that should be fairly straightforward to set up if you have (1) a standard DQN implementation and (2) a standard MCTS implementation, both of which you should be able to find multiple examples of elsewhere online. The main thing you will need to do is to modify the MCTS code to call your neural network at each node to estimate the Q-values, and then to use the action returned by the search rather than the one corresponding to the maximal Q-value. We tried to provide a lot of details in the appendices of both https://arxiv.org/pdf/1904.03177.pdf (see Appendix E) and https://arxiv.org/pdf/1912.02807.pdf (see Appendix A and in particular Algorithm A.1).

If you have any specific questions I am happy to try to clarify!

WenyuHan-LiNa commented 3 years ago

Hi Jessica,

Thank you for your reply. I will try to implement myself. The information you provide is very useful to me.

Best regards, Wenyu Han

On Thu, May 27, 2021 at 6:49 AM Jessica B. Hamrick @.***> wrote:

Hi @WenyuHan-LiNa https://github.com/WenyuHan-LiNa, thanks for your interest in our work!

Unfortunately I am not able to share the code for our MCTS implementation, but if you just want to do a standard DQN plus MCTS, that should be fairly straightforward to set up if you have (1) a standard DQN implementation and (2) a standard MCTS implementation, both of which you should be able to find multiple examples of elsewhere online. The main thing you will need to do is to modify the MCTS code to call your neural network at each node to estimate the Q-values, and then to use the action returned by the search rather than the one corresponding to the maximal Q-value. We tried to provide a lot of details in the appendices of both https://arxiv.org/pdf/1904.03177.pdf (see Appendix E) and https://arxiv.org/pdf/1912.02807.pdf (see Appendix A and in particular Algorithm A.1).

If you have any specific questions I am happy to try to clarify!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/deepmind/dm_construction/issues/2#issuecomment-849532967, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANKETMWFJ2OFGWUUKWCGIBLTPYPTNANCNFSM45TPQCLQ .

WenyuHan-LiNa commented 3 years ago

Hi Jessica,

Did you do Rollout in MCTS or just assign value for each node based on the Q network. Because in the standard MCTS, value for each node is assigned when finishing the rollout at each iteration of MCTS. This makes me confused about that.

Best, Wenyu Han

On Thu, May 27, 2021 at 11:30 AM Wenyu Han @.***> wrote:

Hi Jessica,

Thank you for your reply. I will try to implement myself. The information you provide is very useful to me.

Best regards, Wenyu Han

On Thu, May 27, 2021 at 6:49 AM Jessica B. Hamrick < @.***> wrote:

Hi @WenyuHan-LiNa https://github.com/WenyuHan-LiNa, thanks for your interest in our work!

Unfortunately I am not able to share the code for our MCTS implementation, but if you just want to do a standard DQN plus MCTS, that should be fairly straightforward to set up if you have (1) a standard DQN implementation and (2) a standard MCTS implementation, both of which you should be able to find multiple examples of elsewhere online. The main thing you will need to do is to modify the MCTS code to call your neural network at each node to estimate the Q-values, and then to use the action returned by the search rather than the one corresponding to the maximal Q-value. We tried to provide a lot of details in the appendices of both https://arxiv.org/pdf/1904.03177.pdf (see Appendix E) and https://arxiv.org/pdf/1912.02807.pdf (see Appendix A and in particular Algorithm A.1).

If you have any specific questions I am happy to try to clarify!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/deepmind/dm_construction/issues/2#issuecomment-849532967, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANKETMWFJ2OFGWUUKWCGIBLTPYPTNANCNFSM45TPQCLQ .

jhamrick commented 3 years ago

Hi Wenyu, we indeed just used the value from the Q-network and did not perform rollouts during MCTS. This is similar to how it's done by other neurally guided forms of MCTS like AlphaZero.

Wenyu Han @.***> schrieb am Sa., 29. Mai 2021, 15:49:

Hi Jessica,

Did you do Rollout in MCTS or just assign value for each node based on the Q network. Because in the standard MCTS, value for each node is assigned when finishing the rollout at each iteration of MCTS. This makes me confused about that.

Best, Wenyu Han

On Thu, May 27, 2021 at 11:30 AM Wenyu Han @.***> wrote:

Hi Jessica,

Thank you for your reply. I will try to implement myself. The information you provide is very useful to me.

Best regards, Wenyu Han

On Thu, May 27, 2021 at 6:49 AM Jessica B. Hamrick < @.***> wrote:

Hi @WenyuHan-LiNa https://github.com/WenyuHan-LiNa, thanks for your interest in our work!

Unfortunately I am not able to share the code for our MCTS implementation, but if you just want to do a standard DQN plus MCTS, that should be fairly straightforward to set up if you have (1) a standard DQN implementation and (2) a standard MCTS implementation, both of which you should be able to find multiple examples of elsewhere online. The main thing you will need to do is to modify the MCTS code to call your neural network at each node to estimate the Q-values, and then to use the action returned by the search rather than the one corresponding to the maximal Q-value. We tried to provide a lot of details in the appendices of both https://arxiv.org/pdf/1904.03177.pdf (see Appendix E) and https://arxiv.org/pdf/1912.02807.pdf (see Appendix A and in particular Algorithm A.1).

If you have any specific questions I am happy to try to clarify!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub < https://github.com/deepmind/dm_construction/issues/2#issuecomment-849532967 , or unsubscribe < https://github.com/notifications/unsubscribe-auth/ANKETMWFJ2OFGWUUKWCGIBLTPYPTNANCNFSM45TPQCLQ

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/deepmind/dm_construction/issues/2#issuecomment-850845630, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAUL5AKCXG5YOG5M5H25CTTQD5HZANCNFSM45TPQCLQ .

WenyuHan-LiNa commented 3 years ago

Hi Jessica,

Thank you for sharing this information. For training the DQN network, we still follow the standard way (means sample from replay buffer and update Q network), right?

Best, Wenyu Han

On Sat, May 29, 2021 at 11:47 AM Jessica B. Hamrick < @.***> wrote:

Hi Wenyu, we indeed just used the value from the Q-network and did not perform rollouts during MCTS. This is similar to how it's done by other neurally guided forms of MCTS like AlphaZero.

Wenyu Han @.***> schrieb am Sa., 29. Mai 2021, 15:49:

Hi Jessica,

Did you do Rollout in MCTS or just assign value for each node based on the Q network. Because in the standard MCTS, value for each node is assigned when finishing the rollout at each iteration of MCTS. This makes me confused about that.

Best, Wenyu Han

On Thu, May 27, 2021 at 11:30 AM Wenyu Han @.***> wrote:

Hi Jessica,

Thank you for your reply. I will try to implement myself. The information you provide is very useful to me.

Best regards, Wenyu Han

On Thu, May 27, 2021 at 6:49 AM Jessica B. Hamrick < @.***> wrote:

Hi @WenyuHan-LiNa https://github.com/WenyuHan-LiNa, thanks for your interest in our work!

Unfortunately I am not able to share the code for our MCTS implementation, but if you just want to do a standard DQN plus MCTS, that should be fairly straightforward to set up if you have (1) a standard DQN implementation and (2) a standard MCTS implementation, both of which you should be able to find multiple examples of elsewhere online. The main thing you will need to do is to modify the MCTS code to call your neural network at each node to estimate the Q-values, and then to use the action returned by the search rather than the one corresponding to the maximal Q-value. We tried to provide a lot of details in the appendices of both https://arxiv.org/pdf/1904.03177.pdf (see Appendix E) and https://arxiv.org/pdf/1912.02807.pdf (see Appendix A and in particular Algorithm A.1).

If you have any specific questions I am happy to try to clarify!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <

https://github.com/deepmind/dm_construction/issues/2#issuecomment-849532967

,

or unsubscribe <

https://github.com/notifications/unsubscribe-auth/ANKETMWFJ2OFGWUUKWCGIBLTPYPTNANCNFSM45TPQCLQ

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub < https://github.com/deepmind/dm_construction/issues/2#issuecomment-850845630 , or unsubscribe < https://github.com/notifications/unsubscribe-auth/AAAUL5AKCXG5YOG5M5H25CTTQD5HZANCNFSM45TPQCLQ

.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/deepmind/dm_construction/issues/2#issuecomment-850854151, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANKETMSZRJE7KCBQJZWRPP3TQED7NANCNFSM45TPQCLQ .

jhamrick commented 3 years ago

Yes, that's how we did it in the original construction paper ("Structured Agents for Physical Construction"). However, we also later found using SAVE (https://arxiv.org/abs/1912.02807) worked much better than just pure Q-learning. SAVE works by also adding an additional "amortization loss" to the standard Q-learning loss. It's probably easier to start with Q-learning but I'd encourage you to try the amortization loss too (it's a pretty straightforward change---you just need to store the Q-values computed by MCTS into the replay buffer).

On Sat, May 29, 2021 at 8:37 PM Wenyu Han @.***> wrote:

Hi Jessica,

Thank you for sharing this information. For training the DQN network, we still follow the standard way (means sample from replay buffer and update Q network), right?

Best, Wenyu Han

On Sat, May 29, 2021 at 11:47 AM Jessica B. Hamrick < @.***> wrote:

Hi Wenyu, we indeed just used the value from the Q-network and did not perform rollouts during MCTS. This is similar to how it's done by other neurally guided forms of MCTS like AlphaZero.

Wenyu Han @.***> schrieb am Sa., 29. Mai 2021, 15:49:

Hi Jessica,

Did you do Rollout in MCTS or just assign value for each node based on the Q network. Because in the standard MCTS, value for each node is assigned when finishing the rollout at each iteration of MCTS. This makes me confused about that.

Best, Wenyu Han

On Thu, May 27, 2021 at 11:30 AM Wenyu Han @.***> wrote:

Hi Jessica,

Thank you for your reply. I will try to implement myself. The information you provide is very useful to me.

Best regards, Wenyu Han

On Thu, May 27, 2021 at 6:49 AM Jessica B. Hamrick < @.***> wrote:

Hi @WenyuHan-LiNa https://github.com/WenyuHan-LiNa, thanks for your interest in our work!

Unfortunately I am not able to share the code for our MCTS implementation, but if you just want to do a standard DQN plus MCTS, that should be fairly straightforward to set up if you have (1) a standard DQN implementation and (2) a standard MCTS implementation, both of which you should be able to find multiple examples of elsewhere online. The main thing you will need to do is to modify the MCTS code to call your neural network at each node to estimate the Q-values, and then to use the action returned by the search rather than the one corresponding to the maximal Q-value. We tried to provide a lot of details in the appendices of both https://arxiv.org/pdf/1904.03177.pdf (see Appendix E) and https://arxiv.org/pdf/1912.02807.pdf (see Appendix A and in particular Algorithm A.1).

If you have any specific questions I am happy to try to clarify!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <

https://github.com/deepmind/dm_construction/issues/2#issuecomment-849532967

,

or unsubscribe <

https://github.com/notifications/unsubscribe-auth/ANKETMWFJ2OFGWUUKWCGIBLTPYPTNANCNFSM45TPQCLQ

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub <

https://github.com/deepmind/dm_construction/issues/2#issuecomment-850845630

, or unsubscribe <

https://github.com/notifications/unsubscribe-auth/AAAUL5AKCXG5YOG5M5H25CTTQD5HZANCNFSM45TPQCLQ

.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub < https://github.com/deepmind/dm_construction/issues/2#issuecomment-850854151 , or unsubscribe < https://github.com/notifications/unsubscribe-auth/ANKETMSZRJE7KCBQJZWRPP3TQED7NANCNFSM45TPQCLQ

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/deepmind/dm_construction/issues/2#issuecomment-850888294, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAUL5CLM4D63LB7P5QVSMLTQE67XANCNFSM45TPQCLQ .

WenyuHan-LiNa commented 3 years ago

Hi Jessica,

Thank you for this advice, I will start with standard DQN and try amortization loss later. Very useful suggestion.

Thanks a lot, Wenyu Han

On Sat, May 29, 2021 at 4:52 PM Jessica B. Hamrick @.***> wrote:

Yes, that's how we did it in the original construction paper ("Structured Agents for Physical Construction"). However, we also later found using SAVE (https://arxiv.org/abs/1912.02807) worked much better than just pure Q-learning. SAVE works by also adding an additional "amortization loss" to the standard Q-learning loss. It's probably easier to start with Q-learning but I'd encourage you to try the amortization loss too (it's a pretty straightforward change---you just need to store the Q-values computed by MCTS into the replay buffer).

On Sat, May 29, 2021 at 8:37 PM Wenyu Han @.***> wrote:

Hi Jessica,

Thank you for sharing this information. For training the DQN network, we still follow the standard way (means sample from replay buffer and update Q network), right?

Best, Wenyu Han

On Sat, May 29, 2021 at 11:47 AM Jessica B. Hamrick < @.***> wrote:

Hi Wenyu, we indeed just used the value from the Q-network and did not perform rollouts during MCTS. This is similar to how it's done by other neurally guided forms of MCTS like AlphaZero.

Wenyu Han @.***> schrieb am Sa., 29. Mai 2021, 15:49:

Hi Jessica,

Did you do Rollout in MCTS or just assign value for each node based on the Q network. Because in the standard MCTS, value for each node is assigned when finishing the rollout at each iteration of MCTS. This makes me confused about that.

Best, Wenyu Han

On Thu, May 27, 2021 at 11:30 AM Wenyu Han @.***> wrote:

Hi Jessica,

Thank you for your reply. I will try to implement myself. The information you provide is very useful to me.

Best regards, Wenyu Han

On Thu, May 27, 2021 at 6:49 AM Jessica B. Hamrick < @.***> wrote:

Hi @WenyuHan-LiNa https://github.com/WenyuHan-LiNa, thanks for your interest in our work!

Unfortunately I am not able to share the code for our MCTS implementation, but if you just want to do a standard DQN plus MCTS, that should be fairly straightforward to set up if you have (1) a standard DQN implementation and (2) a standard MCTS implementation, both of which you should be able to find multiple examples of elsewhere online. The main thing you will need to do is to modify the MCTS code to call your neural network at each node to estimate the Q-values, and then to use the action returned by the search rather than the one corresponding to the maximal Q-value. We tried to provide a lot of details in the appendices of both https://arxiv.org/pdf/1904.03177.pdf (see Appendix E) and https://arxiv.org/pdf/1912.02807.pdf (see Appendix A and in particular Algorithm A.1).

If you have any specific questions I am happy to try to clarify!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <

https://github.com/deepmind/dm_construction/issues/2#issuecomment-849532967

,

or unsubscribe <

https://github.com/notifications/unsubscribe-auth/ANKETMWFJ2OFGWUUKWCGIBLTPYPTNANCNFSM45TPQCLQ

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub <

https://github.com/deepmind/dm_construction/issues/2#issuecomment-850845630

, or unsubscribe <

https://github.com/notifications/unsubscribe-auth/AAAUL5AKCXG5YOG5M5H25CTTQD5HZANCNFSM45TPQCLQ

.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <

https://github.com/deepmind/dm_construction/issues/2#issuecomment-850854151

, or unsubscribe <

https://github.com/notifications/unsubscribe-auth/ANKETMSZRJE7KCBQJZWRPP3TQED7NANCNFSM45TPQCLQ

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub < https://github.com/deepmind/dm_construction/issues/2#issuecomment-850888294 , or unsubscribe < https://github.com/notifications/unsubscribe-auth/AAAUL5CLM4D63LB7P5QVSMLTQE67XANCNFSM45TPQCLQ

.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/deepmind/dm_construction/issues/2#issuecomment-850897097, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANKETMR3SSREMHI4A4GME33TQFHZTANCNFSM45TPQCLQ .

WenyuHan-LiNa commented 3 years ago

Hi Jessica,

Hope this email finds you well! I have a few questions as following: Is this possible to use the same MCTS with DQN as you did for the construction paper ("Structured Agents for Physical Construction") to solve POMDP problems. Because I am working on a POMDP, I want to find an MCTS-based method as my baseline. However, when I implement the MCTS for my problem, I find that MCTS requires a transition model which transits from one state to another state. In POMDP, the agent can only access observation. This means I cannot use MCTS for POMDP. Further, I cannot use MCTS with DQN for this problem. Do I understand correctly? Do you have any suggestions for how to use MCTS with Q learning for solving POMDP?

Best, Wenyu Han

On Sat, May 29, 2021 at 5:56 PM Wenyu Han @.***> wrote:

Hi Jessica,

Thank you for this advice, I will start with standard DQN and try amortization loss later. Very useful suggestion.

Thanks a lot, Wenyu Han

On Sat, May 29, 2021 at 4:52 PM Jessica B. Hamrick < @.***> wrote:

Yes, that's how we did it in the original construction paper ("Structured Agents for Physical Construction"). However, we also later found using SAVE (https://arxiv.org/abs/1912.02807) worked much better than just pure Q-learning. SAVE works by also adding an additional "amortization loss" to the standard Q-learning loss. It's probably easier to start with Q-learning but I'd encourage you to try the amortization loss too (it's a pretty straightforward change---you just need to store the Q-values computed by MCTS into the replay buffer).

On Sat, May 29, 2021 at 8:37 PM Wenyu Han @.***> wrote:

Hi Jessica,

Thank you for sharing this information. For training the DQN network, we still follow the standard way (means sample from replay buffer and update Q network), right?

Best, Wenyu Han

On Sat, May 29, 2021 at 11:47 AM Jessica B. Hamrick < @.***> wrote:

Hi Wenyu, we indeed just used the value from the Q-network and did not perform rollouts during MCTS. This is similar to how it's done by other neurally guided forms of MCTS like AlphaZero.

Wenyu Han @.***> schrieb am Sa., 29. Mai 2021, 15:49:

Hi Jessica,

Did you do Rollout in MCTS or just assign value for each node based on the Q network. Because in the standard MCTS, value for each node is assigned when finishing the rollout at each iteration of MCTS. This makes me confused about that.

Best, Wenyu Han

On Thu, May 27, 2021 at 11:30 AM Wenyu Han @.***> wrote:

Hi Jessica,

Thank you for your reply. I will try to implement myself. The information you provide is very useful to me.

Best regards, Wenyu Han

On Thu, May 27, 2021 at 6:49 AM Jessica B. Hamrick < @.***> wrote:

Hi @WenyuHan-LiNa https://github.com/WenyuHan-LiNa, thanks for your interest in our work!

Unfortunately I am not able to share the code for our MCTS implementation, but if you just want to do a standard DQN plus MCTS, that should be fairly straightforward to set up if you have (1) a standard DQN implementation and (2) a standard MCTS implementation, both of which you should be able to find multiple examples of elsewhere online. The main thing you will need to do is to modify the MCTS code to call your neural network at each node to estimate the Q-values, and then to use the action returned by the search rather than the one corresponding to the maximal Q-value. We tried to provide a lot of details in the appendices of both https://arxiv.org/pdf/1904.03177.pdf (see Appendix E) and https://arxiv.org/pdf/1912.02807.pdf (see Appendix A and in particular Algorithm A.1).

If you have any specific questions I am happy to try to clarify!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <

https://github.com/deepmind/dm_construction/issues/2#issuecomment-849532967

,

or unsubscribe <

https://github.com/notifications/unsubscribe-auth/ANKETMWFJ2OFGWUUKWCGIBLTPYPTNANCNFSM45TPQCLQ

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub <

https://github.com/deepmind/dm_construction/issues/2#issuecomment-850845630

, or unsubscribe <

https://github.com/notifications/unsubscribe-auth/AAAUL5AKCXG5YOG5M5H25CTTQD5HZANCNFSM45TPQCLQ

.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <

https://github.com/deepmind/dm_construction/issues/2#issuecomment-850854151

, or unsubscribe <

https://github.com/notifications/unsubscribe-auth/ANKETMSZRJE7KCBQJZWRPP3TQED7NANCNFSM45TPQCLQ

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub < https://github.com/deepmind/dm_construction/issues/2#issuecomment-850888294 , or unsubscribe < https://github.com/notifications/unsubscribe-auth/AAAUL5CLM4D63LB7P5QVSMLTQE67XANCNFSM45TPQCLQ

.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/deepmind/dm_construction/issues/2#issuecomment-850897097, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANKETMR3SSREMHI4A4GME33TQFHZTANCNFSM45TPQCLQ .