Closed pirobot closed 3 years ago
Hi @pirobot
I am going to run a few experiments. It is the holiday time so my responses make take longer than usual. This item is important to us and me. That you for the detailed report and code snippet. If I need more help reproducing I will ping you. Thank you again.
Many thanks @tfboyd and no hurry as I realize we can all use a break during the holidays. :)
Just pinging to see if there has been any progress on this issue? Thanks!
@tfboyd Do you think this issue will get addressed eventually? It would be great if someone could at least reproduce the performance issue I detailed above. Unfortunately our lab will have to abandon tf-agents if this issue cannot be resolved since it will take a month to do experiments that can otherwise be done in a day with stable-baselines. I'm still hoping I have just missed a critical parameter that you will discover after running the above code. We have a definite preference to stick with tf-agents but this performance issue is preventing us from using it in any kind of serious research. Thanks!
Sorry about the delay. I will confirm with the team tomorrow. I wanted to dig into this but I think we will be focusing on other priorities in Q1 with our existing resources. I will ping back tomorrow and put it on the agenda for our weekly team meeting.
I understand your frustration and need to find a platform that works for your situation.
@tfboyd No worries and thanks for the response. We'll use stable-baselines for now and wait to hear back eventually if someone can confirm the performance differences.
@tfboyd quick ping on this one.
@kuanghuei may take a look as the train time seem really long. Thank you for being reasonable and kind as we work to make improvements.
I noticed the same problem, any update on it? Is there also any comparison made other than the speed with Stable_baselines?
I tried to run some experiments on gpu. It seems like the slow down only happens with LunarLanderContinuous-v2 but not on HalfCheetah. Even with a slow down, I was able to run ~80 steps/sec on Titan X, which is probably like 1.38 hours for 500,000 steps.
Did you observe the same slow down on other env or only on lunar lander?
By the way, GPU has higher overhead and is likely to be slower than CPU for small networks. On halfCheetah, my CPU experiments get ~210-230 steps/sec, but GPU experiments only get ~125 steps/sec on my desktop.
Any updates on this? I am not using mentioned environments, but given there might be internal issues in the framework sounds concerning
I tried with a custom environment and it seems GPU performs slower than CPU with the latest tf-agent version
I have not done the exact testing all of you would want but I do have some data that I think is useful. I want to stress this is not remotely apples-to-apples. I have been doing perf testing with TF since before 1.x and I dislike sharing data that does not answer a question exactly; but I have found some data can be better than no data. I also have learned direct comparisons are really hard and using common envs (like mujoco half-cheetah in the case of many RL scenarios) as the starting point is the best approach.
For the new SAC example I we run nightly internal tests and I have published the results with full event logs. The runs were on CPU. As I said, not the GPU numbers you want.
I did a test with stable-baselines, but keep in mind I am not an expert on stable-baselines. On CPU I saw a 10-20% (1.1x to 1.2x) performance difference with stable-baselines being slightly faster. It was a sloppy test and aligning how often everyone evals and making sure it is inline leaves room for errors.
I also noticed baselines seems to be getting ~14K for half-cheetah which is confirmed in one of their git hub issues and my own test run. But i copied the pybullet env info for half-cheetah and I was using MuJoCo. I have no idea how much that impacts the results. I am not throwing shade as we (anyone doing RL) are all in this together. My results below also show 14K, but I found an error in that we were not using the GreedyPolicy for eval and the results now match the paper at 15K I just have not had time to publish all the numbers again. It is possible stable-baselines has a similar mistake, it is not a big deal and I did not have time to look into it.
Again not the CPU performance you want, but in this example at 500K it took ~2 Hours and we eval inline every 10K steps and do 30 episodes (yes 30 is not correct by the paper). My rough napkin estimate is 50 evals at 1m each for 50 minutes (without eval as a datapoint). So let's say the run took 1 hour to 500K steps of half-cheetah on a 16x 2.2Ghz vCPUs (8 physical and 8 logical) without eval and 2 hours doing eval every 10K steps and doing 30 episodes averaged (which is not correct 1 episode is correct). A quick check of the other results: Half-Cheetah, Hopper, and Walker2d are similar with Ant taking much longer at 3 hours minus eval time. None of these are LunarLanderContinuous-v2 and given the length of Ant the env matters.
Informally, I tested GPU for half-cheetah on tf-agents on a GTX-1080. I got a modest performance increase on my workstation using the same batch-size as CPU. I suspect larger batch-sizes are needed to go faster.
I hope to make time to do some GPU testing of stable-baselines as if there is a big difference we need to look into tf-agents as we are both using TensorFlow (v1 vs. V2 I think). I am sorry I have not been able to do this exact test. When I test next time I will try to toss in LunarLanderContinuous-v2, but it is not one of the main envs we test against.
hello, any solutions or walkaround on speed issue with tf.agents?
Help is much appreciated.
On CPU I did not see a huge performance difference testing on the same hardware. I have not been able to go back and test GPU apples to apples. We are forming a plan to test the top agents (likely SAC and PPO) against some other leading libraries in Q1 and address any gaps. If you have a specific use case, that is the most useful to start with. Doing perf is always difficult and I prefer to stick with standard agents so there is a decent chance of apples-to-apples. I was the perf owner for TensorFlow and the hardest part was apples-to-apples to then narrow down the difference. Even something as seemingly obvious as ResNet took ages before people are testing the same exact network. SAC + Half-Cheetah or another env from the paper is the best scenario. I/We could do others but it is more difficult and less universally valuable across the board.
This matters to us and I have not found a straight forward use case and plan to dig deeper in Q1.
This is something we are about. Not to distract you but we just released full testing of the PPO Agent on Mujoco. I am a few weeks again from numbers for the distributed setup (focused on distributed collect)
I know this puts a lot of work on you. But if you have a clear example of X vs. Y. With TF-Agents much slower that is actionable. @yasser-h-khalil
Have we resolved @pirobot 's original concern? If so, we should close this ticket and ask folks to create new ones with their specific code.
I am going to close this as it is misleading. If there is an exact use case let's get on it. We currently do not have proof of a gap in performance but we are going to use our time to look for a gap to be sure in early 2021. As stated above, the process is really hard as apples-to-apples is not as obvious. If anyone seeing this as a specific use case post it and @tfboyd . Perf is hard and it needs to be reproducible for both TF-Agents and the tool that is stated as faster. Having to go dig into another tool to get an exact reproduction is a big time sink. I wish it was easier.
IMO the benchmarks link toby pointed to show that there is no longer a 10x performance drop; but again, please submit any new regressions as a new bug with repro instructions.
I am running a simple test of SAC using the LunarLanderContinuous-v2 environment. Training is for 500,000 steps with a replay buffer of size 50,000 (see code below). tf-agents takes over 10 hours to complete training whereas the stable-baselines implementation of SAC using the same hyperparameters only takes 39 minutes. I've checked and double-check my version of CUDA, tensorflow-gpu, tf-agent, etc and cannot speed things up.
Here are the details to reproduce:
Ubuntu 16.04, tf-agents==0.3.0, tensorflow-gpu==1.15.0, gym==0.15.4, CUDA==10.0, cudnn==7.6.5, stable-baselines==2.9.0a0, GPU==Quadro M4000 8Gb, CPU==i7 64 Gb
My tf-agents test script is simply the v2 train_eval.py script from the sac/examples after substituting the LunarLanderContinuous-v2 environment for Half Cheetah and changing the hyperparameters as you can see below:
My stable-baselines script looks like this:
Finally, here is the output when I run the tf-agents script to show that the GPU is being detected and used:
And the output from nvidia-smi while running the script: