why force env continue even though it returns done?

liuanji / WU-UCT

A novel parallel UCT algorithm with linear speedup and negligible performance loss.

MIT License

110 stars 24 forks source link

why force env continue even though it returns done? #2

Closed YYangZiXin closed 3 years ago

YYangZiXin commented 3 years ago

in tree/UCT.py 185 line, force done to False

liuanji commented 3 years ago

Hi Zixin,

Thanks for the feedback. I think you're right. The code should use the done signal provided in line 179. However, I think this will not change the actual behavior of the agent since it will only perform one extra step with reward 0 (generated in line 192) before the while loop in line 189 terminates.

YYangZiXin commented 3 years ago

another question, tree/UCT.py 114 line the loop seems to be loop running trajectory with out env.reset and your loop_name is simulate_single_step. "simulation" starts on a new expand node , right?

liuanji commented 3 years ago

Each call to simulate_single_step does one rollout (i.e., selection, expansion, simulation, and backdrop). Those calls to simulate_single_step iteratively grow the search tree and accumulate the information collected by the simulations.

Simulation not necessarily starts from a newly expanded node, in certain cases (e.g., the width and depth threshold of the search tree are met), simulations could begin with an existing node.

YYangZiXin commented 3 years ago

so , you simulate a node args.max_step times, and didn't reset the env to the raw state after one simulation?

liuanji commented 3 years ago

Unfortunately, I don't fully get your question but no node will be simulated max_step number of times. Each time simulate_single_step is called a different node is selected for simulation based on the tree policy (the selection step).

checkpoint_data_manager is used to store checkpoints of the environment. Every time a different state is being simulated, its checkpoint will first be loaded by load_checkpoint_env.

YYangZiXin commented 3 years ago

thanks, I understand.

YYangZiXin commented 3 years ago

One more question, why we need simulate_trajectory ,can we just call self.simulate_single_move(init_state) ?

liuanji commented 3 years ago

This is the classic way of using MCTS for planning, which is the methods used in AlphaGo, AlphaZero, MuZero, etc. In a typical MCTS planning setup, we have a state that we want to decide which action to choose. The MCTS is initiated with a single root node corresponding to that state. After performing several rollouts and building a search tree, the MCTS agent hopes to use the information in the tree to decide which action it wants to take. Then we take that action and arrives in the next state, where we again use MCTS to decide which action to choose.

YYangZiXin commented 3 years ago

thanks