muupan / async-rl

Replicating "Asynchronous Methods for Deep Reinforcement Learning" (http://arxiv.org/abs/1602.01783)
MIT License
401 stars 83 forks source link

Running on GPU #10

Closed c3msu closed 8 years ago

c3msu commented 8 years ago

Hi, this is awesome. So far, it performs the best I can find to match original score in the paper. Thanks for your sharing!

Just one question, I saw GPU setting in your code. Wondering if you have ever tested it on GPU? I'm curious if it'll be even faster than aws C4.8xLarge.

Thanks.

muupan commented 8 years ago

No, I haven't tested it. I think supporting GPU computation is possible and worth trying, but don't know if it will actually make training faster. As multiprocessing.RawArray is used to store globally shared parameters, using GPUs will require frequent parameter copy between host and device.

c3msu commented 8 years ago

Do you mean multiple GPU by "require frequent parameter copy between host and device"?

muupan commented 8 years ago

No matter how many GPUs you use, you need to synchronize thread- or process-specific parameters and global parameters after every I_{update} steps. Since global parameters are in CPU, it requires copy between CPU and GPU memory.

This would be not the case if the global parameters reside in GPU memory, but I don't know whether it is possible with python's multiprocessing.

miyosuda commented 8 years ago

I'm using multi-threading, and storing global network parameter in GPU memory. (I'm implementing A3C with TensorFlow.) Even though multi threading has Global Interpreter Lock problem, but GPU version was much faster than CPU in my environment.

All of the computation graph is constructed within c++ layer, so GIL might not be affecting much in my TensorFlow environment. (I haven't compared multi process vs multi thread yet, so I'm just guessing.)

c3msu commented 8 years ago

Thanks @miyosuda, very impressive. I'm trying to build your codes but found some issue. Will spend more time to check. In the mean time, do you have a graph or stats which can show how long the training score reached the same level as original paper? Such as 400 in breakout.

I will close this issue. Thanks both of you.

miyosuda commented 8 years ago

@c3msu I'll prepare graphs comparing cpu and gpu in my repo later. I just compared running 8 parallel game environments on CPU (Core i7-6700) and GPU(GTX 980Ti). Even with GPU, the learning speed with my implementation is slower than the paper, and it doesn't mean GPU is faster than C4.8xLarge.

When I switched from CPU to GPU, learning speed was about 2 times faster in my implementation. @muupan 's implementation is so sophisticated and fast, so I was just curious how much GPU will accelerate it.

miyosuda commented 8 years ago

@c3msu @muupan Sorry I told a lie. I calculated global step size when running on GPU and CPU on my TensorFlow environment. (8 parallel threads on Core i7-600 and GTX980Ti)

They are GPU: 364 steps/sec CPU: 286 steps/sec

GPU is not so fast like twice as I wrote. I wonder why I thought speed was doubled. (When I switched to GPU, I also changed TensorFlow version and loss function etc, but I need to apologize for my mistake.) Anyway I'll put learning graph later.

c3msu commented 8 years ago

@miyosuda Thanks for clarify. Still GPU is quicker. :-)