tesslerc / malmo_rl

MIT License
2 stars 2 forks source link

Memory consumption and learning issues #10

Open Phantomb opened 6 years ago

Phantomb commented 6 years ago

Tested as well with the unaltered codebase and the training example given in the readme. Over time, memory consumed grows to using up all available ram (nearly 16GBs), at which point either: - Python throws an error and stops the execution, or - Malmo becomes unreachable, throwing a specific error message every step (I can't recall which one, will edit when it occurs again).

I believe this wasn't the case for earlier versions of the codebase as I was previously able to run for days on end without issue.

I'm curious if you recognise this behaviour, and what part of the code might be responsible.

tesslerc commented 6 years ago

This could be due to the replay memory, as this is the only section in which we store data (which isn't thrown away). The following params could be the cause:

  1. retain rgb - each state is x3 larger
  2. success replay memory - an additional replay memory which stores only success trajectories
  3. replay memory size - the size itself of the replay memory

it could be this commit: ffe029ec7e9438cc0df528f4d6663cafbd6d0fa9

Phantomb commented 6 years ago

Hmm I'll give it a try learning with a smaller replay memory size. The increase from 5K to 100K seems pretty significant. Do you reckon setting it back to something like 5K, and enabling both the success replay memory & the prioritized experience replay might be a worthwhile setting?

Thanks for the quick answer!

tesslerc commented 6 years ago

I think most problems in Minecraft are rather "simple" meaning short trajectories in a small domain - so this should be fine. Just remember that you need a large enough replay memory so the samples will be as diverse as possible.

Phantomb commented 6 years ago

Allright I did some testing and by playing with the parameters (most importantly keeping the memory smaller) the memory is no longer an issue, thanks!

However, as I was testing this with the single_room domain, it has not managed to achieve any reliable success percentage above 5%, even after 400k steps. Image: image This uses the unchanged master branch code. I ran two versions, one with prioritized replay and success replay: python main.py qr_dqn single_room --number_of_atoms 200 --number_of_agents 1 --malmo_ports 10000 --retain_rgb --save_name qr-dqn-test_extra-replay --replay_memory_size 5000 --success_replay_memory --prioritized_experience_replay And one without: python main.py qr_dqn single_room --number_of_atoms 200 --number_of_agents 1 --malmo_ports 10000 --retain_rgb --save_name qr-dqn-test --replay_memory_size 5000 Both ran for over 460k steps, and both achieving similar results as the image shown.

However, when I tested the single_room domain with the unmodified codebase and the parameters from the readme a while back, it performed much better, reaching a near 100% success rate after only 150k steps. Image: image I am wondering what kind of reliable performance you get out of the single_room domain and with what kind of parameters.

tesslerc commented 6 years ago

From what commit did the test run properly? I'll try and find what I broke in the process.

Phantomb commented 6 years ago

Unfortunately I am not sure of the exact commit, but it was at least shortly before 050800edaace4b2a6b3303fa4bbf08a13773bb74

I'm running a single-room test with the parameters from the readme now on commit 90e2eaeb0ee878785a2005a7fc7c8ecaf1ba01ad to check the performance there (currently at 18k steps)

tesslerc commented 6 years ago

OK, thanks. I'll take a look at this on Sunday. If you have any updates by then regarding your current test, please share :)

Phantomb commented 6 years ago

Thanks, Will do! Out of curiosity, have you run the single-room (or any other domain I can test as well) any time lately? What kind of performance do you get with it?

tesslerc commented 6 years ago

I think the issue is due to the success replay memory probability decay. Default parameter is decay steps = 0 and when decay steps is 0, it falls back to the final probability (defaults to 0%).

Making sure this is the issue, then I'll push a quick fix.

tesslerc commented 6 years ago

Pushed a fix. I see on my end that it solves within 50k steps. The commandline I used: python3.6 main.py qr_dqn single_room --number_of_agents 1 --malmo_ports 10000 --ms_per_tick 75 --save_name qr_single_room --number_of_atoms 200 --double_dqn --epsilon_decay 20000 --success_replay_memory --replay_memory_size 10000

Phantomb commented 6 years ago

Oh that sounds like great performance! I am still running on commit 90e2eae, and it's unly just now starting to pick up success percentage after >350k steps.. newplot

I'll give it a try.

Phantomb commented 6 years ago

Pushed a fix. I see on my end that it solves within 50k steps. The commandline I used: python3.6 main.py qr_dqn single_room --number_of_agents 1 --malmo_ports 10000 --ms_per_tick 75 --save_name qr_single_room --number_of_atoms 200 --double_dqn --epsilon_decay 20000 --success_replay_memory --replay_memory_size 10000

Interestingly enough, using the exact same command line on the latest commit, after 210k steps mine still hadn't finished... : newplot 1

I am going to run it some more times, but I'm wondering what might cause this discrepancy with your results.

Phantomb commented 6 years ago

Ran two more times for over >150k steps on two different machines, I get very similar results to my first try. newplot 2

@tesslerc What do you think might cause this difference in performance between you and me?

tesslerc commented 6 years ago

This is really weird behavior... Can you also try with the flag '--normalize_reward' While in theory, this shouldn't change anything... it might help convergence due to the network learning values on a lower scale.

Phantomb commented 6 years ago

I'll try that now and will report back.

Phantomb commented 6 years ago

I was away for the weekend, so can only report back now.

newplot 3

As you can see, it definitely performs better with --normalize_reward (the first ~0.8 success spike being at 80k steps), but still has never attained 100% success, even after 600k steps... I ran this on two different machines again with similar results.

I'm frankly stumped as to what might cause this difference between you and me. Any ideas? Are you running it on Windows or on Linux?

tesslerc commented 6 years ago

This does seem weird... This should converge as it is a simple problem. I'm running on Ubuntu (linux).

Try under 'https://github.com/tesslerc/malmo_rl/blob/master/agents/single_room.py' to change the supported_actions so as to not include 'move -1'. This reduces the complexity of the problem. This isn't a proper fix, but it might help.

Phantomb commented 6 years ago

I will do a run that way on one machine, and try to do a regular run on Ubuntu on another machine. I'll let you know my findings.

Phantomb commented 6 years ago

A tad late, but reporting my findings from this nonetheless: The regular run on Ubuntu was not much more successful than the others (after nearly 200k steps average around 0.6). I ran it also without move -1 on ubuntu and there after 100k steps it managed to stabalize to virtually 1.0 (so 100%) success. So that's something at least. Right now I'm running some experiments of my own again (on domains like the subskill rooms), but not too great results there either...

I'm not sure how to tackle this issue except for continuing to toy around with parameters and policies. If you have some enlightening insight, that would still be very welcome.

tesslerc commented 6 years ago

This is really puzzling... I assume the issue is with 'move -1' being a "confusing" action. I think what might happen is that the agent walks backwards and touches the block (finishing the mission) yet without seeing it. This might be causing instability.

I'm glad that you were able to find some configuration which worked well.

I suggest to drop this actions and only allow actions correlated to what the agent is seeing (i.e. move forward, interact with objects it can see, turn right/left).

Phantomb commented 6 years ago

The agent walks backwards and touches the block without seeing it. This might be causing instability.

Yeah since the state is only related to what it sees, without very adequate short term state memory, it would be difficult for an agent to understand why a state would suddenly result in a reward.

However, in the subskill domains I'm training now, (e.g. the nav2 and pickup from your paper), I already check for that, but the results thus far are not encouraging either. For example, here is the success rate of nav2 (with qr-dqn), which I both tested with and without retain_rgb for very similar results. nav2 And the pickup-training run (currently at 100k) is performing even worse.

So there are two main distinct things that are giving me pause here.

  1. You stated that you were able to solve the single_room domain with move -1 relatively quickly, whereas that didn't at all work as smoothly for me. So maybe your 50k solve was a very lucky anomaly? I just don't know.
  2. When I'm training the subskill domains, the main difference in the simulation from your paper (as far as I can think of), is that the agent turns in 90° instead of 30°, which might result in skipping over visual cues in the peripheral area. To emulate that, I might try to implement a 45° turn instead, and do manual movements, which would however make the agent a bit more cumbersome (30° would a mess for discrete movements, and with 45° I can somewhat cheat the movement and still put it in the center of the next diagonal square). (addendum: But this seems like a somewhat secondary problem as long as I can't replicate your single_room results)

If you'd like, you can check out my agents to see if I coded anything fishy. For completeness, I ran it with these parameters: python3.6 main.py qr_dqn subskill_nav2 --number_of_agents 1 --malmo_ports 10000 --ms_per_tick 75 --save_name qr_subskill_nav2_test2_c5b1f3 --number_of_atoms 200 --double_dqn --epsilon_decay 20000 --success_replay_memory --replay_memory_size 10000 --normalize_reward --retain_rgb

How are your experiments holding up? Are they learning just fine? Do you have more visual cues maybe?

tesslerc commented 6 years ago

I am somewhat unsure, as I tested again with move -1 and it also didn't work well. It could be that my previous run was without move -1.

But I will take a look at your agents. I believe pickup should learn easily and the navigation2 domain might be a tad harder as it has some partial observability issues (pillars block the line of sight).

** We were able to turn 30 degrees, as we didn't use Malmo in the paper. I really hope this isn't the issue :\

Phantomb commented 6 years ago

I am somewhat unsure, as I tested again with move -1 and it also didn't work well. It could be that my previous run was without move -1.

Ah that might explain a lot.

But I will take a look at your agents.

Thanks!

tesslerc commented 6 years ago

OK, not sure this is the issue but it might be.

For example look at the subskill_pickup.py. In order to define success, you perform the following check: # Check if the agent was already touching the block if self.touching_block: # Check if the agent is facing the block if(((grid[10] == u'gold_block' and yaw == 'north') or (grid[14] == u'gold_block' and yaw == 'east' ) or (grid[16] == u'gold_block' and yaw == 'south') or (grid[12] == u'gold_block' and yaw == 'west' )) and action_command == 'move 1'): # If the agent executed dummy action 'move 1', the agent succeeded return self.reward_from_success, True, state, False, True I had some issues with this before, and I moved to using the built-in behavior from Malmo as you can see here: https://github.com/tesslerc/malmo_rl/blob/master/agents/domains/single_room.xml

Basically, you tell Malmo to provide a large positive reward for success and you check if the reward > 0. If so, the agent has reached the goal.

Additionally, I tend to prefer a zero reward upon success and a negative reward per step. This ensures the Q is strictly negative for any state-action pair which prevents any positive feedback loops which can occur due to estimation issues. Empirically, I saw better results in this configuration.

Phantomb commented 6 years ago

Hey, thanks for the follow up.

For the pickup one, I was looking for a way to give control to Malmo, but none of the available QuitFrom-commands provide a way to do exactly what the original research did (i.e. requiring a specific action to pickup), I'm afraid. I could use AgentQuitFromCollectingItem with just the spawned item in the middle of the room instead of the block. Then I stay more true to the minecraft-world, but it misses the separate action to pick it up, and the object might be relatively small in the peripheral vision / the visual success state might be confusing for the network since the success state visually doesn't show why it suddenly was a success (the item immediately despawn on pickup). What are your thoughts on this?

Also, I have been training subskill_nav2 -- which does have the mission success checks controlled by Malmo -- for 500k steps with qr-dqn, and it doesn't show any learning progress either. The success rate just randomly fluctuates between 0.0 and 0.2. So I'm also not sure how that should be remedied.

I will change the reward structure to be zero upon success.

tesslerc commented 6 years ago

Maybe you can combine AgentQuitFromCollectingItem with breaking the block? If breaking the block causes the agent to pick it up, this could fix your problem.

Though an alternative is to ignore the whole "pickup action" and just require that the agent touch the block (similar to the simple room in this git).

I'll try take a look at subskill_nav2, hopefully I'll have some insights.

Phantomb commented 6 years ago

Yes I was thinking about that as well.

And for the breaking (since I don't want to be breaking everything with the discretecommands), Is there an easy way to check that I'm targeting the correct type of block before attacking? I could override the def perform_action from the agent base class (and copy all the method's code) to put an ObservationFromRay check there (using self.agent_host.peekWorldState() to get the current observation before performing the command, without influencing anything), right?

tesslerc commented 6 years ago

That sounds correct. Though you can remove the assert (action_command in self.supported_actions or action_command == 'new game') from perform_action function. Then overload perform_action in your new agent class checking if to allow the attack 1 command or not. If not, change the command to jump or some other no-op. Finally use super to call the original perform_action with the new command.

This is if you prefer to do less copy-pasting.

Phantomb commented 6 years ago

Ah yes that would be better, I'll implement it like that. It might be a logical addition anyway to remove that check from the base agent and always have a perform_action method functioning like you said in new agents, for more control over the environment.