About Replicating SampledZero Performance in the Hopper-V3 Environment

opendilab / LightZero

[NeurIPS 2023 Spotlight] LightZero: A Unified Benchmark for Monte Carlo Tree Search in General Sequential Decision Scenarios (awesome MCTS)

https://huggingface.co/spaces/OpenDILabCommunity/ZeroPal

Apache License 2.0

1.14k stars 119 forks source link

About Replicating SampledZero Performance in the Hopper-V3 Environment #210

Open hyLiu1994 opened 7 months ago

hyLiu1994 commented 7 months ago

I attempted to replicate the sampledefficientzero results displayed in the Hopper-V3 environment's readme benchmark section using the default configuration file (zoo/mujoco/config/mujoco_sampled_efficientzero_config.py). However, I encountered two main issues during the process:

I was unable to achieve the results illustrated by the blue line in the following graph.

Additionally, I observed significant discrepancies between the results of two runs using the identical configuration file, as depicted in the graph below. Both the blue and gray lines represent outcomes obtained from the same configuration file.

Could you suggest possible reasons for these discrepancies and any solutions to achieve consistent results similar to those presented in the benchmark?

puyuan1996 commented 7 months ago

Hello, thank you for your feedback. Currently, our repository includes an open-source implementation similar to SampledMuZero, which is the only example available since the original authors did not release their source code. Consequently, our implementation may differ from the original in aspects such as network architecture, loss functions, hyperparameters, and training processes. These differences could be one of the reasons for suboptimal performance and instability in training our SampledEfficientZero in continuous action spaces, such as Mujoco. A robust and stable open-source implementation of SampledMuZero would be highly valuable to the community and warrants further investigation. We plan to delve deeper into this matter and will provide updates here. Thank you once again for your valuable input and patience.

hyLiu1994 commented 6 months ago

Thank you for detail response ～

I will try to optimize for this.

If I have any conclusion, I will share with you.

puyuan1996 commented 2 months ago

Hello, we have successfully implemented SampledMuZero and SampledUniZero in this pull request, and have also optimized the previous SampledEfficientZero. Currently, all three algorithms can reliably achieve near-optimal returns within 200k environment steps in the LunarLander and BipedalWalker environments. We encourage you to test them locally.

In the DMC (DeepMind Control Suite), we have also managed to achieve near-optimal returns within approximately 500k environment steps in the Cartpole-Swingup and Walker-Walk environments (state-input). Performance in other DMC environments is still under active tuning. We will keep you updated with any relevant progress as we continue our work. Thank you for your patience.