hammerlab / cytokit

Microscopy Image Cytometry Toolkit
Apache License 2.0
115 stars 18 forks source link

Using more than two GPUs #23

Closed mkeays closed 4 years ago

mkeays commented 4 years ago

Hi @eric-czech ,

I'm now trying to use Cytokit to process a CODEX dataset with 20 cycles, and I see the following issue, when using two GPUs:

2020-01-29 06:12:55,608:INFO:43129:cytokit.exec.pipeline: Loaded tile 33 for region 1 [shape = (20, 11, 4, 1440, 1920)] 2020-01-29 06:12:55,609:INFO:43129:cytokit.ops.drift_compensation: Calculating drift translations 2020-01-29 06:12:56,185:INFO:43125:cytokit.exec.pipeline: Loaded tile 1 for region 1 [shape = (20, 11, 4, 1440, 1920)] 2020-01-29 06:12:56,186:INFO:43125:cytokit.ops.drift_compensation: Calculating drift translations 2020-01-29 06:13:30,956:INFO:43125:cytokit.ops.drift_compensation: Applying drift translations 2020-01-29 06:13:30,968:INFO:43129:cytokit.ops.drift_compensation: Applying drift translations distributed.worker - WARNING - Memory use is high but worker has no data to store to disk. Perhaps some other process is leaking memory? Process memory: 33.74 GB -- Worker memory limit: 48.00 GB distributed.worker - WARNING - Memory use is high but worker has no data to store to disk. Perhaps some other process is leaking memory? Process memory: 33.98 GB -- Worker memory limit: 48.00 GB

I tried processing a subset of the data with 10 cycles, and this began to work as expected.

I then tried to process the full 20-cycle dataset using a node with more GPUs in case this would help (setting gpus: [0, 1, 2, 3, 4, 5, 6, 7]). However for some reason, although 8 workers were set up, Cytokit still only appeared to use two of them (note the "Loaded tile 1" and "Loaded tile 33", out of 63 tiles total):

2020-01-29 06:11:41,623:INFO:42970:root: Execution arguments and environment saved to "output/processor/execution/202001291111.json" 2020-01-29 06:11:50,772:INFO:42970:cytokit.exec.pipeline: Starting Pre-processing pipeline for 8 tasks (8 workers) Using TensorFlow backend. Using TensorFlow backend. Using TensorFlow backend. Using TensorFlow backend. Using TensorFlow backend. Using TensorFlow backend. Using TensorFlow backend. Using TensorFlow backend. 2020-01-29 06:12:55,608:INFO:43129:cytokit.exec.pipeline: Loaded tile 33 for region 1 [shape = (20, 11, 4, 1440, 1920)] 2020-01-29 06:12:55,609:INFO:43129:cytokit.ops.drift_compensation: Calculating drift translations 2020-01-29 06:12:56,185:INFO:43125:cytokit.exec.pipeline: Loaded tile 1 for region 1 [shape = (20, 11, 4, 1440, 1920)] 2020-01-29 06:12:56,186:INFO:43125:cytokit.ops.drift_compensation: Calculating drift translations 2020-01-29 06:13:30,956:INFO:43125:cytokit.ops.drift_compensation: Applying drift translations 2020-01-29 06:13:30,968:INFO:43129:cytokit.ops.drift_compensation: Applying drift translations distributed.worker - WARNING - Memory use is high but worker has no data to store to disk. Perhaps some other process is leaking memory? Process memory: 33.74 GB -- Worker memory limit: 48.00 GB

How can I get it to use all of the GPUs? Do you think it would help with the memory issue even if it does use them?

Thanks, Maria

eric-czech commented 4 years ago

Hi @mkeays ,

You should set the memory_limit argument in the config on the same level as gpus (i.e. under processor.args) to something higher. I don't know how much memory you have on the machine but "64G" would be a good start.

For every gpu, a separate process (via the dask backend) will be spun up to process individual tiles so adding more gpus doesn't actually decrease the memory usage that limit is applied to. The limit should be set to something a little smaller than the total memory on the system divided by the number of gpus you are using for processing. Assuming 64G is enough, then the system should have at least 128G total or you may have to fall back on a single gpu.

I think it's unlikely that 8 GPUs weren't trying to be used in that scenario since the gpus are associated individually with the workers and 8 workers started. It's not impossible there's some kind of bug there, but my guess is that you don't have 8*48G = 384G of RAM on the system and you only saw some messages from worker 1 and worker 5 before everything started running out of RAM to work with.

mkeays commented 4 years ago

Hi @eric-czech , thanks very much for the advice, will try setting the memory limit and see what happens.

mkeays commented 4 years ago

Just to update, increasing the memory_limit did fix this -- I still got the warning about "Memory use is high..." but the processor was not killed and managed to continue. Thanks!

eric-czech commented 4 years ago

Hey awesome! You could keep increasing it more just to hide the warnings now that you know it will all fit, which would be unsafe, but there is also some way to keep dask from throwing warnings when a threshold is reached or to at least change that threshold (though I don't recall what it is off the top of my head). It may be an environment variable or something that keeps you out of needing to modify the code, but I'm not 100% sure.