Longer time for first deconvolution in a process

hammerlab / flowdec

TensorFlow Deconvolution for Microscopy Data

Apache License 2.0

86 stars 26 forks source link

Longer time for first deconvolution in a process #29

Closed chrisroat closed 4 years ago

chrisroat commented 4 years ago

I am running via dask, where a small batch of images (2-30 or so) is passed to a worker. The worker creates a process for each batch, and uses the gpu_options to avoid reserving all memory up front. In one process, I've noted that the first deconv in a process takes a lot longer than subsequent ones -- several seconds vs a fraction of a second. Is there any trick to cut back on that overhead?

eric-czech commented 4 years ago

You could likely eschew much of that by deconvolving a tiny image on initialization and probably get the majority of the remainder by setting allow_growth to false. I'm fairly certain there is no cleaner way to do it purely with a different configuration though.

I think there are a ton of initialization operations in TF resulting in initial cache misses, one-time cpu 2 gpu transfers, memory allocations (particularly with allow_growth=True), cuda start-up optimizations (e.g. cuDNN autotune), etc and I don't think many of those behaviors are meant to be manipulated.

People have logged issues about it for a long time but I doubt anyone has worked to fix it. It might be worth some googling to see if things have changed, but my guess is that the huge majority of users are still using TF in a batch setting and don't really care about the initial lag. That would include me too though as a part of making the image processing pipelines that use this project -- can I ask what you're building? Is it some kind of server to an interactive client?

chrisroat commented 4 years ago

The application is an image processing pipeline working on large-scale terapixel-size confocal imaging. There is an interactive way to work with dask, but I don't expect that to be dominant, except during debugging.

The overhead is not a show-stopper, as I can just load in bigger batches on machine and break them up manually in a loop. . Other options include dedicating some gpus for flowdec only, so I don't worry about the memory reservations interfering with other gpu-based processing.

Most of the processing is local on an image, so dask is very efficient at breaking it into small chunks and operating independently. Making bigger batches in the deconvolution step just breaks that paradigm slightly by bottlenecking the data back to a smaller number of machines.

There are a lot of moving parts here to optimize -- it may be very worthwhile to bottleneck for the increased speedup of the deconvolution. It will mean larger chunks throughout the pipeline -- this slows down other steps, but is still an overall win if deconv is wicked fast.

eric-czech commented 4 years ago

Ah I see. One other thing that might be worth keeping in mind then is that the log2 padding method can be excessive depending on the input chunk dimensions. You may want to try the more optimal padding added as a part of https://github.com/hammerlab/flowdec/pull/18 instead if you're still using 65x448x448 chunks. This would likely save a good bit of memory since a log2 padding out to 128x512x512 is adding a ton of 0s relative to the original array size. This may not necessarily make it faster though because FFT transforms on arrays with log 2 dimensions use a different, faster algorithm in CUDA. I've seen that trade-off swing both ways, but I think given that the z dimension is barely clearing 64 it would be worth considering defining the padding manually or using the 2357 padding.

Either can be set with RichardsonLucyDeconvolver(pad_mode='2357') or RichardsonLucyDeconvolver(pad_mode='none', pad_min=[16, 32, 32]) (or whatever padding you want to use per dimension).

chrisroat commented 4 years ago

Thanks for pointing out the 2357 padding mode. In my test below with the z-dim at 65, I found it to be a little faster on the startup deconv, but it didn't change the memory usage (as reported by nvidia-smi, where I made sure each test started fresh). Is it expected to use the same memory?

from flowdec import restoration as fd_restoration
from flowdec import data as fd_data
import numpy as np
import tensorflow as tf
import time

session_config = tf.compat.v1.ConfigProto()
session_config.gpu_options.allow_growth = True
session_config.gpu_options.per_process_gpu_memory_fraction = 1.0

# log2 = 8.6GB
# 2357 = 8.6GB
algo = fd_restoration.RichardsonLucyDeconvolver(n_dims=3, pad_mode='2357').initialize()
start = time.time()
acq = fd_data.Acquisition(data=np.zeros((65,1024,1024)), kernel=np.zeros((64,64,64)))
res = algo.run(acq, niter=20, session_config=session_config)
end = time.time()
print(end-start)

eric-czech commented 4 years ago

I don't know what it would pad 65 out to but I'm sure it's a lot less than 128 (so that kind of change definitely lead to big drops in GPU mem usage when I've tried it in the past). I'd recommend:

Try this with pad_mode='none' (if you still see the same usage then it's not space actually needed for the matrices since pad_mode none ads no padding)
Try this with CUDA_VISIBLE_DEVICES=0 to force cpu usage and monitor RAM utilization for comparison

If you can figure out how much memory is really needed, then it may be wise to do allow_growth=False and cap the memory_fraction instead. TF may very well be pre-allocating unnecessary space even with allow_growth on but I'm not familiar enough with 2.x to say for sure.