Doesn't work for continuous_mountain_car

joyousrabbit commented 7 years ago

Hello, the algo doesn't work for continuous_mountain_car, because it's reward is -pow(action[0],2)*0.1. What means, the car's initial state is a local max reward, all the exploration will decrease the reward and cannot get evoluated.

Of course, if the car can explore the final solution by one try, it will work. But the probability is negligible.

How do you handle such local max initial state issue???

PatrykChrabaszcz commented 7 years ago

What do you mean by Of course, if the car can explore the final solution by one try, it will work. . I think that if it finds good solution (Reaching the final state) by accident then update in weights will be too small anyway as most of the population will want to keep the policy "Do nothing" . Correct me if I'm wrong but I think that for this experiment you would have to change the way in which policy weights are updated to give more value to much better results and ignore the rest, and you would have to increase the noise so it's possible to find good policy by adding noise to policy that does nothing.

This example is quite hard. I managed to get good results for discrete version (MountainCar-v0) but no success for this one.

joyousrabbit commented 7 years ago

@PatrykChrabaszcz Hello, after the solution is found quickly, the new weights will all be based on that solution.

PatrykChrabaszcz commented 7 years ago

I don't see how one proper solution would drag the weights for the current policy such that it makes it more probable to draw more policies that reach final state in the next generation (for this enviroment). Influence from policies doing nothing will be much bigger when you use current default updating rule.

Maybe you mean initializing current policy (by accident) such that big part of the first population reaches the goal state.

joyousrabbit commented 7 years ago

@PatrykChrabaszcz No, whenever it reachs the goal state, the influence will be big and immediate to the following biased and random weights. Because it's reward is huge compared with other opponents of doing nothing.

PatrykChrabaszcz commented 7 years ago

Reward might be huge but by default if I understand correctly it uses weighted average to update parameters. But the weights in this average are from <-0.5, 0.5> centered_rank . So if there is only one good solution in this population it will be counted as 0.5 but the next one assuming for example population of size 100 will be counted as 0.49. That's why I said you could change the way those weights are updated so it gives this good solution higher importance. Am I right?

joyousrabbit commented 7 years ago

It's not average. It's only based on (R_positive_rank-R_negative_rank)/number_of_rewards, so the huge reward = 1, the tiny reward is 0.0000001. They are independent.

On 11 May 2017 at 19:22, Patryk Chrabaszcz notifications@github.com wrote:

Reward might be huge but by default if I understand correctly it uses weighted average to update parameters. But the weights in this average are from <-0.5, 0.5> centered_rank . So if there is only one good solution in this population it will be counted as 0.5 but the next one assuming for example population of size 100 will be counted as 0.49. That's why I said you could change the way those weights are updated so it gives this good solution higher importance. Am I right?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/openai/evolution-strategies-starter/issues/9#issuecomment-300858409, or mute the thread https://github.com/notifications/unsubscribe-auth/ARFboFbmvTzAs32_hOB-s3yY0in5QaVvks5r40PegaJpZM4NW9SV .

openai / evolution-strategies-starter

Doesn't work for continuous_mountain_car #9