opendilab / LightZero

[NeurIPS 2023 Spotlight] LightZero: A Unified Benchmark for Monte Carlo Tree Search in General Sequential Decision Scenarios (awesome MCTS)
https://huggingface.co/spaces/OpenDILabCommunity/ZeroPal
Apache License 2.0
967 stars 92 forks source link

Clipping reward in Atari while using invertible transform for reward and value target #239

Open marintoro opened 2 weeks ago

marintoro commented 2 weeks ago

Hello,

I see in the code that you are using the invertible h function x ↦ sign(x)(√(|x| + 1) - 1) + εx to scale the value and the reward target. This function has been introduced by T. Pohlen et al and the idea was to remove the clipping of the reward in Atari game.

However I see in the Atari env (atari_lightzero_env.py) in the function create_collector_env_cfg that clip_rewards is set to True. Is that intended or is this a bug?

puyuan1996 commented 2 weeks ago

Hello, we reviewed the papers on MuZero and EfficientZero as well as the source code for EfficientZero, and found that they did not mention using reward clipping. Perhaps they indeed did not employ this technique. Additionally, I consulted the papers by T. Pohlen et al, and as they mentioned, reward clipping can potentially lead to changes in the optimal policy. We will be testing the performance without reward clipping shortly. Thank you again for your suggestion. If you have any other questions, feel free to discuss them at any time.

puyuan1996 commented 6 days ago

Hello, our initial experimental results and analysis can be found here. Best regards.