Running out of memory in 4GB Quadro T2000

joaomamede commented 4 years ago

Hi, This same code works in my 6GB and 8GB nvidia RTX (personal laptop and microscope computer).

My work laptop is a nvidia Quadro T2000 that only has 4GB of ram:

I run into not enough memory errors (my arrays are 11Z 2044Y and 2048X) in my T2000.

def observer(img, i, *args):
    #mgs.append(img.max(axis=0))
    if i % 10 == 0:
        print('Observing iteration = {} (dtype = {}, max = {:.3f})'.format(i, img.dtype, img.max()))   
#config = tf.ConfigProto(device_count={'GPU': 1})
#algo = fd_restoration.RichardsonLucyDeconvolver(n_dims=acq.data.ndim, pad_min=[1, 1, 1], session_config=config).initialize()
algo = fd_restoration.RichardsonLucyDeconvolver(n_dims=psfgfp.ndim
                                                , pad_mode='none'
                                                #, pad_mode='none'
                                                #, pad_min=[1,1,1]
                                                #,pad_min=np.ones(psfdapi.ndim)
                                                ,observer_fn=observer
                                               ).initialize()

I have tried withouth any padding arguments or [1,1,1] as well.

Then I run through my Nd2 files channels and call the algo with:

                    res0 =algo.run(
                    fd_data.Acquisition(
                    data=frames[0],
                    kernel= psfdapi)
                    , niter=15)

This is the error that jupyter notebook spits in my terminal:

2020-06-27 16:36:02.786780: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-06-27 16:36:02.786992: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-06-27 16:36:02.787183: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1247] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 2341 MB memory) -> physical GPU (device: 0, name: Quadro T2000, pci bus id: 0000:01:00.0, compute capability: 7.5)
2020-06-27 16:36:03.444695: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-06-27 16:36:03.672501: W tensorflow/core/common_runtime/bfc_allocator.cc:245] Allocator (GPU_0_bfc) ran out of memory trying to allocate 351.31MiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2020-06-27 16:36:03.672528: E tensorflow/stream_executor/cuda/cuda_fft.cc:249] failed to allocate work area.
2020-06-27 16:36:03.672535: E tensorflow/stream_executor/cuda/cuda_fft.cc:426] Initialize Params: rank: 3 elem_count: 11 input_embed: 11 input_stride: 1 input_distance: 46047232 output_embed: 11 output_stride: 1 output_distance: 46047232 batch_count: 1
2020-06-27 16:36:03.672540: F tensorflow/stream_executor/cuda/cuda_fft.cc:435] failed to initialize batched cufft plan with customized allocator: 
[I 16:36:10.592 NotebookApp] KernelRestarter: restarting kernel (1/5), keep random ports
kernel 0a7ff11c-97f8-41f9-b738-8ca7b0f86ab5 restarted

Anything I can do?

Thank you for flowdec!

joaomamede commented 4 years ago

I managed to analyze with 1022x1022 chunks following your guide at: https://github.com/hammerlab/flowdec/blob/master/python/examples/notebooks/Tile-by-tile%20deconvolution%20using%20dask.ipynb

I tried to pass tf.ProtoConf options to try to use my non-GPU ram but it did not work. If you think chunck processing is the more appropriate path, I guess I'll do that. Would anyone let me know if there's any other solution? Because I'd like that the code would be the same in all my computers to avoid problems with analysis

eric-czech commented 4 years ago

Hi @joaomamede,

A few thoughts:

This may be a useful discussion for you: https://github.com/hammerlab/flowdec/issues/32
Are they float32 or 64 arrays?
How big is the PSF?

I would be surprised if a float32 11x2044x2048 image can't be deconvolved with 4G and pad_mode='none', but I think doing it without any padding is probably not a good idea if there is a lot of information at the boundaries. You're probably better off going the chunking route so you have some more control over it, unfortunately.

joaomamede commented 4 years ago

I didn't change anything about float32/float64. My initial data is uint16 (16bit values).
I made the PSF with Flowdec's psf generator.
Thanks for the link.

I was forced to have RHEL by my institution and it's a library nightmare. FlowDec works in this computer only with Tensorflow 2.0 (1.14 and 1.15 can't find the correct cuda libs).

I pulled out my other laptop's (ubuntu) SSD and will run it with the libraries that I know that work, in this 4GB machine to see if that might be the source of problem.

Just to be sure, I should use TF 1.14 with Cuda 10.1 correct?

Thanks for your help.

eric-czech commented 4 years ago

Yep I think 10.1 is the right cuda version for TF 1.14 (and I would use that). If it's an option on your laptop, I'd also suggest installing docker and doing a docker pull tensorflow/tensorflow:1.14.0-gpu-py3 to get a container that would have everything in order for you already (as far as CUDA toolkit installation goes). That's also a much saner way to manage multiple cuda versions if you need them, e.g. if you want to try different TF versions.

joaomamede commented 4 years ago

conda tensorflow-gpu==1.14 worked with my already installed cuda (10.1) libraries ( the problem was the one from pip). Without any padding options (only flowdec's defaults)

It runs if I run 512,512 chunks, but it I do 1022x1022 (with 0,6,6 overlaps) with :

res1 =arr.map_overlap(
                        deconv,depth=(0,6,6),
                        boundary='reflect',
                        dtype='float32').compute(num_workers=1)

Sometimes it fails, sometimes it doesn't (I guess this rhel8 old gnome version really is GPU mem intensive). Any way to use shared memory with RAM? Would that be slower than 512 by 512 chunks? YacuDecu for example had Device, Stream, Host modes. Any way we can make tensorflow to handle this differently with sending an argument?

Thanks for any help. Finally I will run it in my Lab's quadro RTX with 8GB ram, but I'd like the flexibility to test things in my laptop with the same code.

Here's the whole output from jupyter notebook https://pastebin.com/XQUAkha3

and from the terminal running jupyter https://pastebin.com/41uKpmN8

Note: I hope to make a GUI for FlowDec if I have a bit of time.

hammerlab / flowdec

Running out of memory in 4GB Quadro T2000 #33