tambetm / gym-minecraft

Minecraft environment for Open AI Gym, based on Microsoft's Malmo.
272 stars 29 forks source link

[Question] Agent misses frames very frequently #3

Closed pavitrakumar78 closed 7 years ago

pavitrakumar78 commented 7 years ago

Hi,

I have been using gym-minecraft in an A3C code that I am testing. It requires multiple clients running parallelly (I've used a docker image for that). I use 4 agents to train the model and I frequently get the Agent missed 13 observation(s). message for from all of the agents. I tried reducing the fps to 60 (in options.txt file in docker folder), the number of frames that it misses has decreased, but the agent still very frequently misses frames.

Is there anything else I should try modifying in the options or the xml?

tambetm commented 7 years ago

Hey @pavitrakumar78! Glad to hear somebody is using it! I'm also launching Minecraft instances using Docker and running my A3C-like agent against them.

To decrease the missed observations you would mainly need to change the <MsPerTick> parameter in mission xml file. These are located in gym_minecraft/assets folder within gym-minecraft installation. If you don't feel like changing installed files, you can also load your own mission xml using load_mission_file() method.

But I guess you've already done that. Other than that you could reduce the resolution, I'm using 40x30 and it seems fine. Also make sure that your computer has enough actual CPU cores to run 4 (or more) processes in parallel. In particular AWS uses notion vCPU, where 8 vCPUs actually means 4 actual CPU cores. Finally, make sure that your agent's code can actually keep up with Minecraft. I.e. if you perform some constly sync operations after certain number of frames, this might be the cause.

I'm running it with 5ms ticks and getting 1-2 missed observations occasionally. I've chosen to just ignore it, training seems to progress regardless.

pavitrakumar78 commented 7 years ago

Thanks for your reply @tambetm !

I am running it on a p2.xlarge instance (vCPU is 4) - I've had no trouble running 4 agents (atari, doom) even though the number of physical cores is closer to 2. Sync operations are done in large intervals (batch-a3c) of about 30-20k steps so I don't think that is the problem here. I am using a resolution of 40x40 which isnt that different from what you're using. I have been playing around with the <MsPerTick> in the xml files, and they seem to have improved my situation. My settings are: maxfps: 60 and MsPerTick 200; But understandably, this setting drastically slows down the training. So, I am thinking maybe maxfps: 60 and MsPerTick 125-150 might be the right balance. That will fetch me 6-8 frames per sec, right? So, missing 1 to 3 frames here and there shouldn't affect the training that much.. (or atleast I hope :P) Your maxfps setting is 200, I presume? the same as the one in your docker file.

Edit: I've run the batch-a3c agent for about 10 hours on p2.xlarge instance and I only see performance deteriorate as the number of training steps increases! :( I am testing MinecraftBasic-v0 using this code. I 've also been running a normal DQN agent which uses the Nature network and its the same case there also. I still have not tried your example dual-network agent. I will try it and let you know!

tambetm commented 7 years ago

If you set <MsPerTick> to 125 then you could as well set maxfps to 8 and save some rendering time. One thing you could try is to run Minecraft instances and training agent on different AWS instances. As long as they are in the same region the network latencies should be acceptable. 125ms ticks seems unbearably slow, default is 50ms.

Atari and VizDoom pause the execution in between steps, but Minecraft keeps going. That means if your CPUs cannot handle all processes, then Atari will work slower, but will not loose frames. With Minecraft you will be losing a lot of frames and that may hinder learning.

I tried the duel example long time ago and I don't really remember how well it was doing. Right now I have different codebase that will be public one day.

pavitrakumar78 commented 7 years ago

Ahh.. I did not know that about Atari and VizDoom, thank you for explaining it! :) I will try to host MC clients on separate instances and see if there is any improvement!

pavitrakumar78 commented 7 years ago

@tambetm Hi! I am currently trying to run a malmo docker container on a aws and trying to connect to it from my PC/Another AWS server in same region, but it seems to be a one way connection only.

On AWS instance I ran the docker container as: docker run --net=host quay.io/tambet/malmo:0.18 and added the following inbound rule: 10000 tcp 0.0.0.0/0, ::/0

I am able to connect to the Malmo instance running on the server using:

import gym
import gym_minecraft
import minecraft_py

env = gym.make('MinecraftBasic-v0')
env.init(client_pool=[("54.164.105.100", 10000)], allowDiscreteMovement=["move", "turn"])

Where 54.164.105.100 is the AWS server IP.

I see the server status changing (initializing minecraft env) on the AWS console if I call env.reset() from my local PC, but the prompt freezes after env.reset(). It keeps waiting for observation to be returned, but it never arrives. Should I add any more security rules to my servers? (I have not done this before, sorry if this is a very obvious question :( )

tambetm commented 7 years ago

Probably your local PC is behind NAT. It can be challenging to get it working in that case. You might have more chance on another AWS server, but you need to open ports on both servers.