inoryy / reaver

Reaver: Modular Deep Reinforcement Learning Framework. Focused on StarCraft II. Supports Gym, Atari, and MuJoCo.
MIT License
554 stars 89 forks source link

Unstable performance, sometimes agent converges to no_op action #7

Closed CHENGY12 closed 5 years ago

CHENGY12 commented 6 years ago

Thank you for great release. I try to train an agent on CollectMineralShards, but can not repeat the performance as reported. I made several tries, but only get reward=75 at 100k steps. Is there any config parameters I should change? Thanks~

inoryy commented 6 years ago

I don't recall using any special hyperparameters for this map, defaults should work. Just to be clear, by steps you mean training steps with 512 samples each? Are you using the default feature/action space config (not the readme example)? How many agents do you use, 32?

CHENGY12 commented 6 years ago

I use the feature/action space config in the readme, and try 32 agents and 24 agents. The steps I said is the iterations in the tensorboard.

inoryy commented 6 years ago

Try with 32 agents on the default feature/actions (simply don't specify the cfg_path arg).

CHENGY12 commented 6 years ago

OK~Thank you very much!

CHENGY12 commented 6 years ago

by the way, which map do I need to change feature/action space config.

inoryy commented 6 years ago

I initially created it for the FindAndDefeatZerglings map, but I actually just used the default when I prepared the results for my thesis.

SarunasSS commented 6 years ago

I would like to add to the thank you for some proper piece of code :)

I would like to ask you something. I am trying to replicate the CollectMineralShards and so far failed to climb close to 100 score within 200k obs. Afaik, the only difference is that I use 8 workers rather than the default 32. However, that should only make my training longer right? Since the graph x is determined in number of batches of 512. Thanks

inoryy commented 6 years ago

within 200k obs

by obs you mean number of updates (n_updates in console logs) or number of samples (n_samples in console logs)? The learning curve numbers show n_updates. My runs converge to 100 score around 35k updates, which is about 18 million samples.

use 8 workers rather than the default 32. However, that should only make my training longer right?

In on-policy algorithms such as A2C, agent count can significantly affect performance. It should eventually converge, but not guaranteed to be in the same number of samples.

SarunasSS commented 6 years ago

Makes sense thanks :)

inoryy commented 6 years ago

@SarunasSS I've ran some tests with 8 agents and discovered a subtle bug where the agent stops moving and "poisons" grads with 0 ep reward. I guess with 32 agents it didn't matter as on average it still improved so I never noticed. I have an idea where it's coming from but can't give an ETA on the fix for now.

SarunasSS commented 6 years ago

@inoryy what do you mean stops moving? It could stop when it explores non-move actions right ( eg. all the selects ) ? So it could be related to the exploration scheme

inoryy commented 6 years ago

@SarunasSS no, it looks like it just completely stops taking any actions for the rest of an episode (and can eventually lead to all agents producing 0 rewards for the rest of the run). It might be trying to make an invalid move, can only know for sure after manually investigating which is difficult since all of this happens at random even on same seed.

CHENGY12 commented 6 years ago

Hi, @inoryy I also discover this bug. I print the actions and softmax probabilities. The agent stop moving because action is no_op, whose index is 0 in action list. Besides, probabilities of all actions are 0, because selected actions are masked by the "available actions" in the config, the normalization doesn't work when all probabilities are 0 . I try to add a uniform distribution, when all probabilities of "available actions" are 0, to encourage exploration.

SarunasSS commented 6 years ago

@inoryy I managed to replicate the issue as well. Indeed like @CHENGY12 said the problem is that the no_op becomes the only action with p > 0. Depending on your reward structure this could be a local minimum. Eg. in Defeat roaches if marines do not engage the roaches the score is 0 which is better than losing all marines ( -9 ) thus the no_op can dominate.

Any ideas how to resolve this?

inoryy commented 6 years ago

@SarunasSS I'll look into it during this weekend. Should be easy to find thanks to @CHENGY12 information.

SarunasSS commented 6 years ago

I have been investigating the this problem in depth. In most of the training tries I get that the agents converge to using the no_op operation and gets stuck there no matter what exploration scheme I'm using ( I've tried boltzmann and e-greedy ).

It is weird that it converges to no-op even though it has reached large rewards before. Do you have any ideas what might be going wrong?

inoryy commented 6 years ago

@SarunasSS sorry I got side-tracked a bit. I'm almost certain the issue boils down to a case where all available action policy probabilities end up 0, so re-normalization does nothing and results with very bad gradients. It should be an easy fix, but the bigger issue is finding the time / hardware to extensively test it out.

inoryy commented 6 years ago

A little update for people following this issue: I'm currently re-writing the project essentially from scratch, so in the interest of time I've decided not to investigate the issue in legacy codebase. During rewrite I'll of course make sure to avoid repeating the problem.

Re-written project will include updated dependencies, cleaner API, better flexibility, optimized performance and much more. ETA on initial release: end of august.

inoryy commented 5 years ago

Fixed!