Open hyLiu1994 opened 7 months ago
Hello, thank you for your feedback. Currently, our repository includes an open-source implementation similar to SampledMuZero, which is the only example available since the original authors did not release their source code. Consequently, our implementation may differ from the original in aspects such as network architecture, loss functions, hyperparameters, and training processes. These differences could be one of the reasons for suboptimal performance and instability in training our SampledEfficientZero in continuous action spaces, such as Mujoco. A robust and stable open-source implementation of SampledMuZero would be highly valuable to the community and warrants further investigation. We plan to delve deeper into this matter and will provide updates here. Thank you once again for your valuable input and patience.
Thank you for detail response ~
I will try to optimize for this.
If I have any conclusion, I will share with you.
Hello, we have successfully implemented SampledMuZero and SampledUniZero in this pull request, and have also optimized the previous SampledEfficientZero. Currently, all three algorithms can reliably achieve near-optimal returns within 200k environment steps in the LunarLander and BipedalWalker environments. We encourage you to test them locally.
In the DMC (DeepMind Control Suite), we have also managed to achieve near-optimal returns within approximately 500k environment steps in the Cartpole-Swingup and Walker-Walk environments (state-input). Performance in other DMC environments is still under active tuning. We will keep you updated with any relevant progress as we continue our work. Thank you for your patience.
I attempted to replicate the sampledefficientzero results displayed in the Hopper-V3 environment's readme benchmark section using the default configuration file (zoo/mujoco/config/mujoco_sampled_efficientzero_config.py). However, I encountered two main issues during the process:
Could you suggest possible reasons for these discrepancies and any solutions to achieve consistent results similar to those presented in the benchmark?