Open Phantomb opened 6 years ago
This could be due to the replay memory, as this is the only section in which we store data (which isn't thrown away). The following params could be the cause:
it could be this commit: ffe029ec7e9438cc0df528f4d6663cafbd6d0fa9
Hmm I'll give it a try learning with a smaller replay memory size. The increase from 5K to 100K seems pretty significant. Do you reckon setting it back to something like 5K, and enabling both the success replay memory & the prioritized experience replay might be a worthwhile setting?
Thanks for the quick answer!
I think most problems in Minecraft are rather "simple" meaning short trajectories in a small domain - so this should be fine. Just remember that you need a large enough replay memory so the samples will be as diverse as possible.
Allright I did some testing and by playing with the parameters (most importantly keeping the memory smaller) the memory is no longer an issue, thanks!
However, as I was testing this with the single_room
domain, it has not managed to achieve any reliable success percentage above 5%, even after 400k steps. Image:
This uses the unchanged master branch code. I ran two versions, one with prioritized replay
and success replay
:
python main.py qr_dqn single_room --number_of_atoms 200 --number_of_agents 1 --malmo_ports 10000 --retain_rgb --save_name qr-dqn-test_extra-replay --replay_memory_size 5000 --success_replay_memory --prioritized_experience_replay
And one without:
python main.py qr_dqn single_room --number_of_atoms 200 --number_of_agents 1 --malmo_ports 10000 --retain_rgb --save_name qr-dqn-test --replay_memory_size 5000
Both ran for over 460k steps, and both achieving similar results as the image shown.
However, when I tested the single_room
domain with the unmodified codebase and the parameters from the readme a while back, it performed much better, reaching a near 100% success rate after only 150k steps. Image:
I am wondering what kind of reliable performance you get out of the single_room
domain and with what kind of parameters.
From what commit did the test run properly? I'll try and find what I broke in the process.
Unfortunately I am not sure of the exact commit, but it was at least shortly before 050800edaace4b2a6b3303fa4bbf08a13773bb74
I'm running a single-room
test with the parameters from the readme now on commit 90e2eaeb0ee878785a2005a7fc7c8ecaf1ba01ad to check the performance there (currently at 18k steps)
OK, thanks. I'll take a look at this on Sunday. If you have any updates by then regarding your current test, please share :)
Thanks, Will do! Out of curiosity, have you run the single-room (or any other domain I can test as well) any time lately? What kind of performance do you get with it?
I think the issue is due to the success replay memory probability decay. Default parameter is decay steps = 0 and when decay steps is 0, it falls back to the final probability (defaults to 0%).
Making sure this is the issue, then I'll push a quick fix.
Pushed a fix. I see on my end that it solves within 50k steps.
The commandline I used:
python3.6 main.py qr_dqn single_room --number_of_agents 1 --malmo_ports 10000 --ms_per_tick 75 --save_name qr_single_room --number_of_atoms 200 --double_dqn --epsilon_decay 20000 --success_replay_memory --replay_memory_size 10000
Oh that sounds like great performance! I am still running on commit 90e2eae, and it's unly just now starting to pick up success percentage after >350k steps..
I'll give it a try.
Pushed a fix. I see on my end that it solves within 50k steps. The commandline I used:
python3.6 main.py qr_dqn single_room --number_of_agents 1 --malmo_ports 10000 --ms_per_tick 75 --save_name qr_single_room --number_of_atoms 200 --double_dqn --epsilon_decay 20000 --success_replay_memory --replay_memory_size 10000
Interestingly enough, using the exact same command line on the latest commit, after 210k steps mine still hadn't finished... :
I am going to run it some more times, but I'm wondering what might cause this discrepancy with your results.
Ran two more times for over >150k steps on two different machines, I get very similar results to my first try.
@tesslerc What do you think might cause this difference in performance between you and me?
This is really weird behavior... Can you also try with the flag '--normalize_reward' While in theory, this shouldn't change anything... it might help convergence due to the network learning values on a lower scale.
I'll try that now and will report back.
I was away for the weekend, so can only report back now.
As you can see, it definitely performs better with --normalize_reward
(the first ~0.8 success spike being at 80k steps), but still has never attained 100% success, even after 600k steps... I ran this on two different machines again with similar results.
I'm frankly stumped as to what might cause this difference between you and me. Any ideas? Are you running it on Windows or on Linux?
This does seem weird... This should converge as it is a simple problem. I'm running on Ubuntu (linux).
Try under 'https://github.com/tesslerc/malmo_rl/blob/master/agents/single_room.py' to change the supported_actions so as to not include 'move -1'. This reduces the complexity of the problem. This isn't a proper fix, but it might help.
I will do a run that way on one machine, and try to do a regular run on Ubuntu on another machine. I'll let you know my findings.
A tad late, but reporting my findings from this nonetheless: The regular run on Ubuntu was not much more successful than the others (after nearly 200k steps average around 0.6). I ran it also without move -1
on ubuntu and there after 100k steps it managed to stabalize to virtually 1.0 (so 100%) success. So that's something at least. Right now I'm running some experiments of my own again (on domains like the subskill rooms), but not too great results there either...
I'm not sure how to tackle this issue except for continuing to toy around with parameters and policies. If you have some enlightening insight, that would still be very welcome.
This is really puzzling... I assume the issue is with 'move -1' being a "confusing" action. I think what might happen is that the agent walks backwards and touches the block (finishing the mission) yet without seeing it. This might be causing instability.
I'm glad that you were able to find some configuration which worked well.
I suggest to drop this actions and only allow actions correlated to what the agent is seeing (i.e. move forward, interact with objects it can see, turn right/left).
The agent walks backwards and touches the block without seeing it. This might be causing instability.
Yeah since the state is only related to what it sees, without very adequate short term state memory, it would be difficult for an agent to understand why a state would suddenly result in a reward.
However, in the subskill domains I'm training now, (e.g. the nav2 and pickup from your paper), I already check for that, but the results thus far are not encouraging either. For example, here is the success rate of nav2 (with qr-dqn), which I both tested with and without retain_rgb
for very similar results.
And the pickup-training run (currently at 100k) is performing even worse.
So there are two main distinct things that are giving me pause here.
single_room
domain with move -1
relatively quickly, whereas that didn't at all work as smoothly for me. So maybe your 50k solve was a very lucky anomaly? I just don't know.If you'd like, you can check out my agents to see if I coded anything fishy. For completeness, I ran it with these parameters: python3.6 main.py qr_dqn subskill_nav2 --number_of_agents 1 --malmo_ports 10000 --ms_per_tick 75 --save_name qr_subskill_nav2_test2_c5b1f3 --number_of_atoms 200 --double_dqn --epsilon_decay 20000 --success_replay_memory --replay_memory_size 10000 --normalize_reward --retain_rgb
How are your experiments holding up? Are they learning just fine? Do you have more visual cues maybe?
I am somewhat unsure, as I tested again with move -1
and it also didn't work well. It could be that my previous run was without move -1
.
But I will take a look at your agents. I believe pickup should learn easily and the navigation2 domain might be a tad harder as it has some partial observability issues (pillars block the line of sight).
** We were able to turn 30 degrees, as we didn't use Malmo in the paper. I really hope this isn't the issue :\
I am somewhat unsure, as I tested again with move -1 and it also didn't work well. It could be that my previous run was without move -1.
Ah that might explain a lot.
But I will take a look at your agents.
Thanks!
OK, not sure this is the issue but it might be.
For example look at the subskill_pickup.py
.
In order to define success, you perform the following check:
# Check if the agent was already touching the block
if self.touching_block:
# Check if the agent is facing the block
if(((grid[10] == u'gold_block' and yaw == 'north') or
(grid[14] == u'gold_block' and yaw == 'east' ) or
(grid[16] == u'gold_block' and yaw == 'south') or
(grid[12] == u'gold_block' and yaw == 'west' )) and
action_command == 'move 1'):
# If the agent executed dummy action 'move 1', the agent succeeded
return self.reward_from_success, True, state, False, True
I had some issues with this before, and I moved to using the built-in behavior from Malmo as you can see here: https://github.com/tesslerc/malmo_rl/blob/master/agents/domains/single_room.xml
Basically, you tell Malmo to provide a large positive reward for success and you check if the reward > 0. If so, the agent has reached the goal.
Additionally, I tend to prefer a zero reward upon success and a negative reward per step. This ensures the Q is strictly negative for any state-action pair which prevents any positive feedback loops which can occur due to estimation issues. Empirically, I saw better results in this configuration.
Hey, thanks for the follow up.
For the pickup one, I was looking for a way to give control to Malmo, but none of the available QuitFrom
-commands provide a way to do exactly what the original research did (i.e. requiring a specific action to pickup), I'm afraid.
I could use AgentQuitFromCollectingItem
with just the spawned item in the middle of the room instead of the block. Then I stay more true to the minecraft-world, but it misses the separate action to pick it up, and the object might be relatively small in the peripheral vision / the visual success state might be confusing for the network since the success state visually doesn't show why it suddenly was a success (the item immediately despawn on pickup).
What are your thoughts on this?
Also, I have been training subskill_nav2
-- which does have the mission success checks controlled by Malmo -- for 500k steps with qr-dqn, and it doesn't show any learning progress either. The success rate just randomly fluctuates between 0.0 and 0.2.
So I'm also not sure how that should be remedied.
I will change the reward structure to be zero upon success.
Maybe you can combine AgentQuitFromCollectingItem
with breaking the block?
If breaking the block causes the agent to pick it up, this could fix your problem.
Though an alternative is to ignore the whole "pickup action" and just require that the agent touch the block (similar to the simple room in this git).
I'll try take a look at subskill_nav2
, hopefully I'll have some insights.
Yes I was thinking about that as well.
And for the breaking (since I don't want to be breaking everything with the discretecommands
), Is there an easy way to check that I'm targeting the correct type of block before attacking? I could override the def perform_action
from the agent
base class (and copy all the method's code) to put an ObservationFromRay
check there (using self.agent_host.peekWorldState()
to get the current observation before performing the command, without influencing anything), right?
That sounds correct. Though you can remove the assert (action_command in self.supported_actions or action_command == 'new game')
from perform_action
function.
Then overload perform_action
in your new agent class checking if to allow the attack 1
command or not. If not, change the command to jump
or some other no-op. Finally use super
to call the original perform_action
with the new command.
This is if you prefer to do less copy-pasting.
Ah yes that would be better, I'll implement it like that. It might be a logical addition anyway to remove that check from the base agent
and always have a perform_action method functioning like you said in new agents, for more control over the environment.
Tested as well with the unaltered codebase and the training example given in the readme. Over time, memory consumed grows to using up all available ram (nearly 16GBs), at which point either: - Python throws an error and stops the execution, or - Malmo becomes unreachable, throwing a specific error message every step (I can't recall which one, will edit when it occurs again).
I believe this wasn't the case for earlier versions of the codebase as I was previously able to run for days on end without issue.
I'm curious if you recognise this behaviour, and what part of the code might be responsible.