talmo / leap

LEAP is now deprecated -- check out its successor SLEAP!
https://sleap.ai
Apache License 2.0
206 stars 48 forks source link

Not sure if I am running on GPU #9

Closed kbakhurin closed 5 years ago

kbakhurin commented 5 years ago

Hello,

I am encountering very long training and prediction times (overnight requirements usually). From reading your paper and tutorial, it seems like I might not be using the GPU, but I am not sure. The computer becomes unusable during these times (no mouse responses), so I imagine that it is taking up GPU processing power. I noticed that you used a machine with 128GB of RAM; mine only has 16GB. Could that be a limiting factor?

Thank you! Konstantin Bakhurin

talmo commented 5 years ago

Hi Konstantin,

It should definitely not take that long to train and RAM shouldn't be the limitation either.

Figuring out if you're running Keras/Tensorflow on the GPU is a common problem though!

Try typing this out in MATLAB: >> !python -c "import tensorflow as tf; print(tf.test.is_gpu_available())"

You'll get a bunch of stuff printed out, including a line about which device is being used. The last statement printed out should say "True" if Tensorflow is using the GPU.

Here's what it looks like on mine:

>> !python -c "import tensorflow as tf; print(tf.test.is_gpu_available())"
2018-10-22 14:56:54.362417: I T:\src\github\tensorflow\tensorflow\core\platform\cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
2018-10-22 14:56:54.805914: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1405] Found device 0 with properties: 
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.683
pciBusID: 0000:01:00.0
totalMemory: 11.00GiB freeMemory: 9.10GiB
2018-10-22 14:56:54.806668: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1484] Adding visible gpu devices: 0
2018-10-22 14:56:55.686152: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-10-22 14:56:55.686351: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:971]      0 
2018-10-22 14:56:55.686478: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:984] 0:   N 
2018-10-22 14:56:55.686744: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1097] Created TensorFlow device (/device:GPU:0 with 8795 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:01:00.0, compute capability: 6.1)
True 

If you're getting something different, it's possible you installed the CPU version or Tensorflow can't use your GPU for some reason. If that's the case, try uninstalling and reinstalling the GPU version from the console:

pip uninstall tensorflow
pip install tensorflow-gpu==1.10.0

You can change the version number of 1.6.0 depending on the CUDA version you had installed, but just give it a go and let me know if you're still having issues.

Talmo

kbakhurin commented 5 years ago

Hi Talmo,

Thank you for your help. It seems like I wasn't using the GPU and have switched to a new machine where I was able to (I think) successfully stumble through a fresh installation of tensorflow-gpu.

In Matlab, when I navigate to the LEAP directory and run install_leap, the python output seems to go smoothly until it gets to the importing of leap package itself:

capture

I think there is something I need to do with pyversion function but i'm not exactly sure.

Do you know what the issue could be?

kbakhurin commented 5 years ago

Hi again,

I went through the training procedure anyway (label_joints) just to see what would happen. Matlab is interrupted at the point where it calls python:

capture

I'm not exactly sure what it is asking for but the directory for h5py in my Anaconda3\lib\site-packages\ folder does not contain any of those pxd/pyx that it is looking for.

Thanks for your help, Konstantin

talmo commented 5 years ago

Hi Konstantin,

Sorry for the delay! I've seen this error once or twice but I've had a hard time figuring out which version of which package exactly causes it. Try updating h5py through MATLAB and let's see if that solves it:

!pip install --upgrade h5py==2.8.0

Let me know if that works for you!

Talmo

kbakhurin commented 5 years ago

Hi Talmo,

Thanks for the reply. I fixed it on my own. In case someone else has these issues, I made a fresh install of anaconda and then tensorflow 1.6.0. I also needed to make sure that the PATH variables directed to python. This altogether seemed to help with the h5py problem. I needed to downgrade to python 3.6.6 because matplotlib 3.0.0 requires that version of python.

Everything is good. Lightning fast training. I actually yelled in amazement when I saw it go!

I do have one more question. What is the difference between the Initialize with Trained Model and Fast Train Network in the label_joints program?

Konstantin

talmo commented 5 years ago

That's great! Sorry -- Python package versions can be a bit of a pain.

The Initialize with Trained Model will allow you to run a trained model (e.g., from another dataset or movie) on the current file. This is handy when you want to start off from initial predictions from a previous iteration or set of frames. In contrast, Fast Train Network will train a new network from scratch on the current labels.

Closing this one but feel free to re-open or create another issue if you're having any other problems :). Thanks for sharing the troubleshooting tips!