Closed YYangZiXin closed 3 years ago
Hi Zixin,
Thanks for the feedback. I think you're right. The code should use the done signal provided in line 179. However, I think this will not change the actual behavior of the agent since it will only perform one extra step with reward 0 (generated in line 192) before the while loop in line 189 terminates.
another question, tree/UCT.py 114 line the loop seems to be loop running trajectory with out env.reset and your loop_name is simulate_single_step. "simulation" starts on a new expand node , right?
Each call to simulate_single_step does one rollout (i.e., selection, expansion, simulation, and backdrop). Those calls to simulate_single_step iteratively grow the search tree and accumulate the information collected by the simulations.
Simulation not necessarily starts from a newly expanded node, in certain cases (e.g., the width and depth threshold of the search tree are met), simulations could begin with an existing node.
so , you simulate a node args.max_step times, and didn't reset the env to the raw state after one simulation?
Unfortunately, I don't fully get your question but no node will be simulated max_step number of times. Each time simulate_single_step is called a different node is selected for simulation based on the tree policy (the selection step).
checkpoint_data_manager is used to store checkpoints of the environment. Every time a different state is being simulated, its checkpoint will first be loaded by load_checkpoint_env.
thanks, I understand.
One more question, why we need simulate_trajectory ,can we just call self.simulate_single_move(init_state) ?
This is the classic way of using MCTS for planning, which is the methods used in AlphaGo, AlphaZero, MuZero, etc. In a typical MCTS planning setup, we have a state that we want to decide which action to choose. The MCTS is initiated with a single root node corresponding to that state. After performing several rollouts and building a search tree, the MCTS agent hopes to use the information in the tree to decide which action it wants to take. Then we take that action and arrives in the next state, where we again use MCTS to decide which action to choose.
thanks
in tree/UCT.py 185 line, force done to False