jinzishuai / learn2deeplearn

A repository of codes in learning deep learning
GNU General Public License v3.0
13 stars 1 forks source link

I cannot get FrozenLake-v0 OpenAMI Gym to Perform consistently more than 78% #43

Open jinzishuai opened 6 years ago

jinzishuai commented 6 years ago

FrozenLake-v0 defines "solving" as getting average reward of 0.78 over 100 consecutive trials.

But my results are around 70-75%: https://github.com/jinzishuai/learn2deeplearn/blob/master/learnRL/OpenAIGym/FrozenLake/results.md

we should test this which claims to be 0.79 ± 0.05: https://gym.openai.com/evaluations/eval_4VyQBhXMRLmG9y9MQA5ePA/

jinzishuai commented 6 years ago
E:\ShiJin\learn2deeplearn\learnRL\OpenAIGym\FrozenLake\genetic>python frozenlake_genetic_algorithm.py
[2018-01-29 15:47:57,703] Making new env: FrozenLake-v0
Generation 1 : max score = 0.20
Generation 2 : max score = 0.30
Generation 3 : max score = 0.60
Generation 4 : max score = 0.66
Generation 5 : max score = 0.79
Generation 6 : max score = 0.84
Generation 7 : max score = 0.80
Generation 8 : max score = 0.78
Generation 9 : max score = 0.79
Generation 10 : max score = 0.80
Generation 11 : max score = 0.80
Generation 12 : max score = 0.80
Generation 13 : max score = 0.82
Generation 14 : max score = 0.80
Generation 15 : max score = 0.86
Generation 16 : max score = 0.81
Generation 17 : max score = 0.81
Generation 18 : max score = 0.78
Generation 19 : max score = 0.84
Generation 20 : max score = 0.79
Best policy score = 0.85. Time taken = 52.1701
Best policy = [[0 3 3 3]
 [0 3 0 1]
 [3 1 0 3]
 [3 2 1 0]]
jinzishuai commented 6 years ago

But it only performed at less than 75%

Not any better than my algorithms.

E:\ShiJin\learn2deeplearn\learnRL\OpenAIGym\FrozenLake>python fl_human_policy.py
[2018-01-29 15:51:33,075] Making new env: FrozenLake-v0
policy=
[[ 0  3  3  3]
 [ 0 -1  0 -1]
 [ 3  1  0 -1]
 [-1  2  1 -1]]

7355 out of 10000 runs were successful
jinzishuai commented 6 years ago

Original Sample Size=100

There is a clearly a difference between the first and second computation of the scores due to stochasticity:

E:\ShiJin\learn2deeplearn\learnRL\OpenAIGym\FrozenLake\genetic>python frozenlake_genetic_algorithm.py
[2018-01-29 20:29:33,700] Making new env: FrozenLake-v0
Generation 1 : max score = 0.15, recomputed score=0.16
Generation 2 : max score = 0.34, recomputed score=0.28
Generation 3 : max score = 0.65, recomputed score=0.59
Generation 4 : max score = 0.74, recomputed score=0.84
Generation 5 : max score = 0.80, recomputed score=0.84
Generation 6 : max score = 0.82, recomputed score=0.67
Generation 7 : max score = 0.85, recomputed score=0.67
Generation 8 : max score = 0.81, recomputed score=0.74
Generation 9 : max score = 0.81, recomputed score=0.61
Generation 10 : max score = 0.79, recomputed score=0.61
Generation 11 : max score = 0.78, recomputed score=0.70
Generation 12 : max score = 0.83, recomputed score=0.71
Generation 13 : max score = 0.82, recomputed score=0.74
Generation 14 : max score = 0.81, recomputed score=0.67
Generation 15 : max score = 0.77, recomputed score=0.70
Generation 16 : max score = 0.80, recomputed score=0.76
Generation 17 : max score = 0.80, recomputed score=0.67
Generation 18 : max score = 0.83, recomputed score=0.76
Best policy score = 0.84. Time taken = 90.5285
Best policy = [[0 3 3 3]
 [0 2 2 2]
 [3 1 0 2]
 [2 2 1 3]]
best policy scoe = 0.71

Increase sample size to 1000

E:\ShiJin\learn2deeplearn\learnRL\OpenAIGym\FrozenLake\genetic>python frozenlake_genetic_algorithm.py
[2018-01-29 20:31:13,861] Making new env: FrozenLake-v0
Generation 1 : max score = 0.17, recomputed score=0.17
Generation 2 : max score = 0.28, recomputed score=0.29
Generation 3 : max score = 0.60, recomputed score=0.59
Generation 4 : max score = 0.71, recomputed score=0.69
Generation 5 : max score = 0.70, recomputed score=0.70
Generation 6 : max score = 0.70, recomputed score=0.69
Generation 7 : max score = 0.75, recomputed score=0.74
Generation 8 : max score = 0.74, recomputed score=0.73
Generation 9 : max score = 0.74, recomputed score=0.76
Generation 10 : max score = 0.75, recomputed score=0.72
Generation 11 : max score = 0.75, recomputed score=0.73
Generation 12 : max score = 0.75, recomputed score=0.72
Generation 13 : max score = 0.75, recomputed score=0.72
Generation 14 : max score = 0.76, recomputed score=0.73
Generation 15 : max score = 0.76, recomputed score=0.72
Generation 16 : max score = 0.75, recomputed score=0.73
Generation 17 : max score = 0.75, recomputed score=0.71
Generation 18 : max score = 0.75, recomputed score=0.73
Best policy score = 0.77. Time taken = 859.8491
Best policy = [[0 3 3 3]
 [0 3 2 1]
 [3 1 0 0]
 [2 2 1 1]]
best policy scoe = 0.74
jinzishuai commented 6 years ago

Sample Size 10k

def evaluate_policy(env, policy, n_episodes=10000):
    total_rewards = 0.0
    for _ in range(n_episodes):
        total_rewards += run_episode(env, policy)
    return total_rewards / n_episodes

results:

E:\ShiJin\learn2deeplearn\learnRL\OpenAIGym\FrozenLake\genetic>python frozenlake_genetic_algorithm.py
[2018-01-29 20:49:12,746] Making new env: FrozenLake-v0
Generation 1 : max score = 0.16, recomputed score=0.16
Generation 2 : max score = 0.30, recomputed score=0.30
Generation 3 : max score = 0.48, recomputed score=0.48
Generation 4 : max score = 0.69, recomputed score=0.70
Generation 5 : max score = 0.74, recomputed score=0.74
Generation 6 : max score = 0.74, recomputed score=0.74
Generation 7 : max score = 0.74, recomputed score=0.74
Generation 8 : max score = 0.73, recomputed score=0.73
Generation 9 : max score = 0.75, recomputed score=0.74
Generation 10 : max score = 0.74, recomputed score=0.74
Generation 11 : max score = 0.74, recomputed score=0.74
Generation 12 : max score = 0.74, recomputed score=0.74
Generation 13 : max score = 0.75, recomputed score=0.74
Generation 14 : max score = 0.74, recomputed score=0.74
Generation 15 : max score = 0.75, recomputed score=0.74
Generation 16 : max score = 0.75, recomputed score=0.74
Generation 17 : max score = 0.75, recomputed score=0.75
Generation 18 : max score = 0.75, recomputed score=0.75
Best policy score = 0.75. Time taken = 9602.4171
Best policy = [[0 3 3 3]
 [0 2 0 1]
 [3 1 0 3]
 [1 2 1 3]]
best policy scoe = 0.74
jinzishuai commented 6 years ago

Conclusion: expectation of best results 74-75%