haroldsultan / MCTS

Python Implementations of Monte Carlo Tree Search
236 stars 85 forks source link

Rewards not according to criteria #4

Open Syaoran87 opened 6 years ago

Syaoran87 commented 6 years ago

Hi,

Thank you for your awesome code (mcts.py). However, I think the algorithm is not working according to the policy network. For example, in the link you gave, under section 2.1, it is stated that it aims to find the policy that yields the highest reward. When I ran the algorithm, the nodes/child selected were not the highest rewards. The root child selected also did not have the highest reward. Does this mean that the policy network is not optimized, and therefore, did not choose the best action/search?

E.g. as follows: level 0 Num Children: 4 (0, Node; children: 4; visits: 354; reward: 295.106667) (1, Node; children: 4; visits: 928; reward: 826.746667) <--- This was selected by the algorithim (2, Node; children: 4; visits: 582; reward: 504.422222) (3, Node; children: 4; visits: 933; reward: 831.537778) <--- Shouldn't this be selected? Best Child: Value: 20; Moves: [20]

The input parameters I ran was 10000 loops and 8 levels.

Thank you once again for your help. Hope to hear from you soon.

James

haroldsultan commented 6 years ago

James,

You are correct, the algorithm is built to optimize reward/visit. This is the most common approach in the literature and is called MAXCHILD. There are other options like ROBUSTCHILD which optimize by selecting the most visited child, although that is not what the code is doing.

In your example, Node 1 and 3 are close. Node 3 has a reward/visit ratio of 0.89125163772 while Node 1 has a reward/visit ratio of 0.89089080495. So you correct, node 3 should have been selected.

If you look at the code, the end of UCTSEARCH is a call to BESTCHILD(,0). Setting the scalar in BESTCHILD to 0 means we do not care about exploring at all and will pick the best child by selecting the child with the highest reward/visit. If there is a tie, we randomly pick between the bests. In your case I don't see what happened. Would you mind including some debug prints in the BESTCHILD function and see what is happening. Let me know if you want me to check in some code with the debug prints.

Syaoran87 commented 6 years ago

Hi haroldsultan!

Thank you for your quick response! Really appreciate it! I'm using the same code as uploaded on your github. In the parameter settings, I left the function BESTCHILD(root,0) as 0. I have added the prints which I hope can identify the error.

The original output: level 0 Num Children: 4 (0, Node; children: 4; visits: 901; reward: 801.297778) (1, Node; children: 4; visits: 625; reward: 543.942222) (2, Node; children: 4; visits: 462; reward: 393.466667) (3, Node; children: 4; visits: 993; reward: 887.457778) Best Child: Value: 20; Moves: [20]

level 1 Num Children: 4 (0, Node; children: 4; visits: 732; reward: 661.217778) (1, Node; children: 4; visits: 718; reward: 647.982222) (2, Node; children: 4; visits: 597; reward: 532.697778) (3, Node; children: 4; visits: 671; reward: 603.208889) Best Child: Value: 2; Moves: [20, -18]

level 2 Num Children: 4 (0, Node; children: 4; visits: 554; reward: 512.702222) (1, Node; children: 4; visits: 232; reward: 200.040000) (2, Node; children: 4; visits: 323; reward: 287.355556) (3, Node; children: 4; visits: 558; reward: 516.524444) Best Child: Value: 18; Moves: [20, -18, 16]

level 3 Num Children: 4 (0, Node; children: 4; visits: 378; reward: 351.084444) (1, Node; children: 4; visits: 319; reward: 292.400000) (2, Node; children: 4; visits: 365; reward: 338.155556) (3, Node; children: 4; visits: 420; reward: 392.973333) Best Child: Value: -3; Moves: [20, -18, 16, -21]

level 4 Num Children: 4 (0, Node; children: 4; visits: 177; reward: 160.048889) (1, Node; children: 4; visits: 306; reward: 291.062222) (2, Node; children: 4; visits: 217; reward: 200.377778) (3, Node; children: 4; visits: 289; reward: 273.684444) Best Child: Value: 9; Moves: [20, -18, 16, -21, 12]

level 5 Num Children: 4 (0, Node; children: 4; visits: 197; reward: 185.795556) (1, Node; children: 4; visits: 222; reward: 211.857778) (2, Node; children: 4; visits: 229; reward: 219.191111) (3, Node; children: 4; visits: 211; reward: 200.382222) Best Child: Value: -1; Moves: [20, -18, 16, -21, 12, -10]

level 6 Num Children: 4 (0, Node; children: 4; visits: 133; reward: 124.777778) (1, Node; children: 4; visits: 188; reward: 182.902222) (2, Node; children: 4; visits: 188; reward: 183.013333) (3, Node; children: 4; visits: 118; reward: 109.075556) Best Child: Value: 7; Moves: [20, -18, 16, -21, 12, -10, 8]

level 7 Num Children: 4 (0, Node; children: 4; visits: 167; reward: 163.924444) (1, Node; children: 4; visits: 152; reward: 147.773333) (2, Node; children: 4; visits: 139; reward: 133.800000) (3, Node; children: 4; visits: 152; reward: 147.773333) Best Child: Value: 1; Moves: [20, -18, 16, -21, 12, -10, 8, -6]

logger.info(BESTCHILD(root,0)): INFO:MyLogger:simulation: 9999 INFO:MyLogger:Node; children: 4; visits: 10000; reward: 8770.471111 INFO:MyLogger:Node; children: 4; visits: 2790; reward: 2455.066667 INFO:MyLogger:Node; children: 4; visits: 2827; reward: 2548.728889 INFO:MyLogger:Node; children: 4; visits: 1741; reward: 1588.404444 INFO:MyLogger:Node; children: 4; visits: 1424; reward: 1322.213333 INFO:MyLogger:Node; children: 4; visits: 925; reward: 866.120000 INFO:MyLogger:Node; children: 4; visits: 875; reward: 835.364444 INFO:MyLogger:Node; children: 4; visits: 664; reward: 639.906667 INFO:MyLogger:Node; children: 4; visits: 596; reward: 582.471111

I have attached the debug prints within the BESTCHILD function (bestchildren). bestchildren.zip

Thank you once again for your help!

James

Syaoran87 commented 6 years ago

Hi Harold,

I've trying to figure out the error. However, I still cannot not identify why the highest rewards node is not selected as the best child... I would like to seek you advice on this.

Thank you.

James