talmo / leap

LEAP is now deprecated -- check out its successor SLEAP!
https://sleap.ai
Apache License 2.0
206 stars 48 forks source link

running out of GPU memory #10

Closed kbakhurin closed 5 years ago

kbakhurin commented 5 years ago

Hi Talmo,

I am trying to set up LEAP on a smaller workstation and am running into memory issues.

Here is just a small portion of the output I get after trying to train the network:

totalMemory: 4.00GiB freeMemory: 3.29GiB 2018-11-05 18:28:56.449298: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\gpu\gpu_device.cc:1312] Adding visible gpu devices: 0 2018-11-05 18:28:59.085527: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\gpu\gpu_device.cc:993] Creating TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3027 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1050 Ti, pci bus id: 0000:01:00.0, compute capability: 6.1) 2018-11-05 18:29:01.817627: W C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.55GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.

Is the card just not usable? I will try downsampling the video to see if that helps.

Thanks, Konstantin

talmo commented 5 years ago

Hi Kostantin,

You can definitely downsample to get around it but you can also try reducing the batch size in the training parameters. Try 8 or 16.

How big are your frames now?

kbakhurin commented 5 years ago

Hi Talmo,

The sizes were Height: 209 and Width: 338.

Downsampling by half got the training to run. Which parameters do you mean exactly? Is that the number of clusters?

Konstantin

talmo commented 5 years ago

My apologies Konstantin! I thought I had already replied to this. Did you manage to get the issue solved?

If not this is what I was going to suggest: image

I'll also mention that you might have issues with the size of 338x209. The dimensions of the images need to be divisible by 4 (336x208 would be ok).

kbakhurin commented 5 years ago

OK I will try that. I don't prefer to downsample since it reduces the resolution of the position data.

Is it possible that a graphics card on one computer would have a capacity of 4GB and another card of 11GB?

Would you suggest to perform all analysis on one computer to be consistent? Or if I play with the parameters, is it possible to use different machines?

Also, would you consider writing up a quick cheat-sheet for the parameters and their meaning?

Love the program by the way.

talmo commented 5 years ago

Hi Konstantin,

I know what you mean with the downsampling but you'd be surprised at how accurate you can still be at lower resolutions. Check this out: image

In any case, I was just suggesting cropping/padding the edges by 1-2 pixels so hopefully it won't make too much of a difference.

GPUs: yes, you can definitely use it on both computers and no changes should need to be made. The only thing is that you might need to decrease the batch size during training for the small GPU and there may be some effect on the training procedure itself. I haven't noticed any issues when training with smaller batch size but it'll always depend on the data so it's hard to say. Let me know if you notice any big differences!

Parameters: Yes, I had a smaller description in the tutorial but I need to update it. For what it's worth, the rest of the params don't have too much of an effect on the final performance and the defaults work for nearly all of the datasets I've experimented with. Will get on that documentation soon though!

Let me know if you have any issues training/running stuff (especially across GPUs) when you get a chance!

Cheers,

Talmo