hammerlab / flowdec

TensorFlow Deconvolution for Microscopy Data
Apache License 2.0
86 stars 26 forks source link

Flowdec video memory usage #19

Closed JulianPitney closed 5 years ago

JulianPitney commented 5 years ago

Hi,

Not sure if it's alright to post this here, but looked for an email and couldn't find one.

I'm wondering what the input stack dimension limits are. We have a GTX Titan with 11GB of memory but still need to downsample our stacks quite a bit. Is there a way to deconvolve a stack of dimensions (1440, 1080, 1000) or is that unfeasible with flowdec?

dmilkie commented 5 years ago

Iā€™d recommend splitting the volume into smaller (and slightly overlapped) subarrays and decon each one serially. Then stitch/fuse the results back together.

-Dan

On Apr 9, 2019, at 5:39 PM, Julian Pitney notifications@github.com wrote:

Hi,

Not sure if it's alright to post this here, but looked for an email and couldn't find one.

I'm wondering what the input stack dimension limits are. We have a GTX Titan with 11GB of memory but still need to downsample our stacks quite a bit. Is there a way to deconvolve a stack of dimensions (1440, 1080, 1000) or is that unfeasible with flowdec?

ā€” You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

eric-czech commented 5 years ago

Hey @JulianPitney, to follow up on @dmilkie's suggestion it is also probably worth keeping the padding behaviors in mind if you attempt that method. For the sake of the external faces of the volume, assuming there is useful information in the images near them, padding would be necessary to avoid boundary artifacts.

As a simple example, if you were to split the volume into 4 approximately (720, 540, 500) chunks and use the default padding behavior then the software will be attempting to deconvolve (1024, 1024, 512) images, a size which may again be too large for GPU memory.

If you were to split the volume into 16 chunks instead then you shouldn't have any issues regardless of the padding method, but for the sake of keeping the overall time as low as possible, you might want to consider starting with something closer to 4 large chunks and use the padding feature added recently by @VolkerH in https://github.com/hammerlab/flowdec/pull/18. His optimization may make it possible for you to use much larger chunks and still avoid padding-related issues.

VolkerH commented 5 years ago

I see that the issue is closed, so maybe no longer relevant for @JulianPitney.
I had a look at this regardless as I will run into VRAM limitations for some of my datasets as well. I prepared a simple example here that makes use of dask to facilitate processing in chunks:

https://github.com/VolkerH/flowdec/blob/tiling_example/python/examples/notebooks/Process%20in%20tiles%20using%20dask.ipynb

Note however, that while each tile is padded in this example, the tiles are not overlapping and the tile size must be chosen such that an integer number of tiles make up the whole array.

VolkerH commented 5 years ago

I got told that there is support for overlapping computation in dask as well, will have to give that a try http://docs.dask.org/en/latest/array-overlap.html My colleagues @jni and @GenevieveBuckley gave some pointers to this and wanted to be kept in the loop.

VolkerH commented 5 years ago

Ok, that was easy ... swapping out map_blocks for map_overlapping now gives overlapping tiles. Updated the notebook linked to above.

@eric-czech If you want to include the tiling example in the main repo let me know and I'll create a pull request

eric-czech commented 5 years ago

Certainly @VolkerH -- that's an excellent example. I'd appreciate the PR.

JulianPitney commented 5 years ago

Hi guys,

I began yesterday on my own chunked deconvolution implementation but after reading the example provided by @VolkerH I think I'll be copy/pasting that and modifying it for our needs šŸ‘ Thanks all for responding. All very helpful info.

chrisroat commented 4 years ago

Hi there!

This example is great! I have a question about the GPU memory. I notice that you create the algo outside the task function's scope. Is there any way to release memory, say if a dask worker's next task needs the GPU for a different reason?

Even with a solution similar to this one, I found that if I deconvolved hundreds of images, something was eating up my GPU memory and it was eventually exhausted. Should that have happened? (It's always possible a notebook or something else was trying to grab memory.)

Thanks in advance!

eric-czech commented 4 years ago

Hey @chrisroat,

Unless things have changed with 2.x, then freeing GPU memory usage reliably is still probably not possible (cf. #1578). You shouldn't have issues deconvoling hundreds of image volumes though as long as you set allow_growth=True for the session, create as many dask workers as there are GPUs, point each worker at one specific GPU only, use the processes dask scheduler (not threading) and create only one "algo" instance per worker.

Here is an example configuration from a related project where I did this: cytokit/op.py#L34 (deconvolving tens of thousands of images over the course of many hours/days was possible like this). That shows at least one way to point workers at a specific GPU in case you're running with multiple.

Passing these configurations to the RichardsonLucyDeconvolver algo instance is shown here. It's important not to create more than one algo instance, if you're going to use it repeatedly, because it is basically just a wrapper for the TF graph instance which should only be defined once for a given set of decon parameters.

Lastly, I think there is an issue floating around in this repo on parallel deconvolution on the same GPU, but I would definitely recommend against trying that (in case you are trying to have more dask workers than GPUs). It likely doesn't make sense to do it that way unless you have tiny images relative to the memory available on the GPU. If you need multiple workers to do other things in parallel but want to lock access to a single algo instance, then this post shows a convenient way to do so.

chrisroat commented 4 years ago

Thanks @eric-czech ! It sounds like it should work out if I follow the example of creating a new process. Nice! Out of curiosity, why is the allow_growth option doing? It seems counter-intuitive that I should set it True if I'm trying to save memory....

To be clear, I would create the algo between the creating of the session and the closing of it? Flowdec would then be tied to that session?

eric-czech commented 4 years ago

With allow_growth enabled, TF will allocate GPU memory as needed rather than pre-allocating all of it (i.e. it will, by default, immediately grab 100% of GPU memory w/o allow_growth b/c per_process_gpu_memory_fraction defaults to 1). This means that without changing allow_growth and/or per_process_gpu_memory_fraction it isn't possible to run more than one process using TF on the same GPU.

You should create an algo instance (which initializes the tf.Graph) once per python process (or once per dask worker using the "processes" scheduler) and then simply call .run on that instance (which opens and closes a tf.Session internally). You shouldn't create the sessions yourself.

chrisroat commented 4 years ago

So far, I've got this working and am pretty happy. I wanted to run some numbers by you to see if they make sense. In a process, I'm instantiating one algo with the gpu_options set as you indicated. The process deconvolves two 65x448x448 images (uint16) with a 64x64x64 PSF (float) -- which pads each out to 128x512x512, I believe.

The process takes about 7 seconds and consumes 4.6 GB of memory on a Titan RTX w/ 24 GB memory. Does the memory usage seem in the right ballpark? For the 30 Megapixel (128x512x512) images, the memory usage seems high to me.

VolkerH commented 4 years ago

Just came across this. From my experience (not from a back over the envelope calculation) a video usage of 4 GB for the 65x448x448 seems plausible, depending on exact padding options. 7 seconds seems long to me, but it depends on the number of iterations (which you didn't mention).