hammerlab / flowdec

TensorFlow Deconvolution for Microscopy Data
Apache License 2.0
86 stars 26 forks source link

Impractical GPU memory requirements #43

Open SebastienTs opened 1 year ago

SebastienTs commented 1 year ago

While indeed extremely fast, the GPU memory requirement is impractical on my setup: about 8 GB for a 1024x1024x19 image (16-bit) and a tiny 32x32x16 PSF. For images slightly above 1024x1024 (same number of Z slices), I can only run the code on a RTX 3090 (24 GB)!

The problem seems to stem from the FFT CUDA kernel. The error reported is:

tensorflow/stream_executor/cuda/cuda_fft.cc:253] failed to allocate work area. tensorflow/stream_executor/cuda/cuda_fft.cc:430] Initialize Params: rank: 3 elem_count: 32 input_embed: 32 input_stride: 1 input_distance: 536870912 output_embed: 32 output_stride: 1 output_distance: 536870912 batch_count: 1 tensorflow/stream_executor/cuda/cuda_fft.cc:439] failed to initialize batched cufft plan with customized allocator:

Something is probably not right in the code... anybody knows of a workaround?

eric-czech commented 1 year ago

Hey @SebastienTs, there are a number of reasons the memory usage is often way more than you might expect:

  1. The PSF is padded to the size of the image, so it doesn't matter if it's smaller
  2. Tensorflow FFT operations don't support sub 32 bit types (or at least they didn't when this was written)
  3. The image array is copied in intermediate states
  4. Other tensorflow memory overhead (e.g. as observed in https://github.com/hammerlab/flowdec/issues/32)
  5. Often most importantly, the default padding mode pushes all dimensions up to next power of 2 (so in your case 1024x1024x19 would become 1024x1024x32 for both the image and the PSF).

I would suggest you try pad_mode='2357' which is a more memory-efficient but less computationally-efficient (sometimes) method added in https://github.com/hammerlab/flowdec/pull/18.

Apart from that, the only other practical option is to chunk the arrays as in Tile-by-tile deconvolution using dask.ipynb.

SebastienTs commented 1 year ago

Thanks a lot for your reply! I had 1, 2 and 5 in mind but even then do you really believe that 3 and 4 could explain the remaining 30x memory overhead (from 270 MB to 8 GB)?

If that is the case I can sleep peacefuly but it sounds like a real lot to me and I want to make sure that something is not misconfigured or extremely suboptimal for the Tensorflow version I am using...

I have not seen any noticeable reduction in memory usage by using pad_mode='2357' when invoking fd_restoration.RichardsonLucyDeconvolver.

I would happily consider the cucim alternative that is recommened but unfortunately my code needs to run on a Windows box.

eric-czech commented 1 year ago

Hm well 10x wouldn't surprise me too much but 30x does seem extreme. When it comes to potential TF issues I really have no idea.

You should take a look at this too if you haven't seen it: https://github.com/hammerlab/flowdec/issues/42#issuecomment-1113233702. Some of those alternatives to this library may be Windows friendly.

joaomamede commented 1 year ago

Have a look in my repo:

https://github.com/joaomamede/mamedelab_scripts/blob/main/notebooks/Google2021_Deconvolve_Live_gui.ipynb

I basically use dask to divide the images and assemble them again when the GPU mem is not enough.

This is the bioformats version (older, might have some tweaks to be done) https://github.com/joaomamede/mamedelab_scripts/blob/main/notebooks/2021deconvolve_live_bioformats.ipynb

They should be able to run on google collaboratory version if you'd like to tweak around.

You also need the libraries at: https://github.com/joaomamede/mamedelab_scripts/blob/main/notebooks/libraries/deco_libraries.py

hope it helps

I can do 2048x1024 times two in my 6GB laptop. 2048x2048 usually need 12GB vRAM.

The other option is to add the "RAM option" that will share RAM and vRAM and it's still a lot faster than only normal RAM.