rlworkgroup / garage

A toolkit for reproducible reinforcement learning research.
MIT License
1.85k stars 309 forks source link

PyTorch on CPU is slower than TF #1020

Open ryanjulian opened 4 years ago

ryanjulian commented 4 years ago

See https://github.com/pytorch/pytorch/issues/975 for more info

PyTorch TRPO appears 50% slower than TF. Not sure about PPO, but I expect the wall-clock time gap will be the same.

To fix this issue, make PyTorch perform at least as well as TF, or confirm that we've done the best we can on CPU with PyTorch.

naeioi commented 4 years ago

The issue you linked from pytorch was fixed quite a while ago. I think if garage's pytorch is slower than TF then most likely it has something to do with our implementation.

@lywong92 Since you added pytorch DDPG and PPO, I want to know if you have any observation on performance against TF? So we can know if this has something to do with pytorch in general or just TRPO itself.

lywong92 commented 4 years ago

The issue you linked from pytorch was fixed quite a while ago. I think if garage's pytorch is slower than TF then most likely it has something to do with our implementation.

@lywong92 Since you added pytorch DDPG and PPO, I want to know if you have any observation on performance against TF? So we can know if this has something to do with pytorch in general or just TRPO itself.

I didn't pay too much attention on the actual time it took to run DDPG in torch vs tf. Are we comparing the total time it takes to run the algorithms with the same parameters here?

alex-petrenko commented 4 years ago

This might be unrelated, but we found big performance differences between using llvm-openmp vs intel-openmp. Weirdly enough, this is observed even when we use GPU for both forward and backward pass.

Some strange dependency issue in conda is causing this (e.g. the newest version of libgcc runtime depends on package _openmp_mutex which brings along the llvm openmp runtime instead of the Intel one). Worth checking which OpenMP implementation you're using.

jamesborg46 commented 3 years ago

I've encountered issues with the run time of pytorch on CPU before which have been improved by artificially limiting the number of threads utilized with a call such as torch.set_num_threads(4) - I am not sure why exactly, but it seems that pytorch sometimes will incorrectly utilize the number of threads.