labsyspharm / quantification

Quantification module for mcmicro
https://github.com/labsyspharm/mcmicro
9 stars 13 forks source link

Memomory fix #8

Closed JoshuaHess12 closed 4 years ago

JoshuaHess12 commented 4 years ago

Commits for exporting single channel csv and concatenating to address memory issues. Command line arguments for non parallel implementation remain the same (still use CommandSingleCellExtraction). Additional argument '--n_jobs' added for ParallelCommandSingleCellExtraction (Use integer value).

DenisSch commented 4 years ago

@JoshuaHess12 Awesome! Can you "quantify" how much memory is needed now vs before?

JoshuaHess12 commented 4 years ago

Testing on a small IMC image on my laptop:

Before (single plane): Current memory usage is 1.71296MB; Peak was 10.898458MB After (single plane): Current memory usage is 1.744194MB; Peak was 10.943114MB After (Parallel single plane 8 cores): Current memory usage is 1.757832MB; Peak was 10.192733MB

Strange that the memory slightly increased after changing the non-parallel implementation. What are your thoughts on this? It did not seem like exporting single planes and concatenating the results benefitted the memory usage .

On Tue, Mar 24, 2020 at 2:05 PM DenisSch notifications@github.com wrote:

@JoshuaHess12 https://github.com/JoshuaHess12 Awesome! Can you "quantify" how much memory is needed now vs before?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/labsyspharm/quantification/pull/8#issuecomment-603415406, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALCXKJB6LWYQKGEADPQXLLDRJDY5PANCNFSM4LS3723Q .

DenisSch commented 4 years ago

I will test on exemplar before merging.

DenisSch commented 4 years ago

New:

Current memory usage is 19.501476MB; Peak was 160.488012MB
Finished 0
Current memory usage is 19.501901MB; Peak was 160.488012MB
Finished 1
Current memory usage is 19.502542MB; Peak was 160.488012MB
Finished 2
Current memory usage is 19.503343MB; Peak was 160.488012MB
Finished 3
Current memory usage is 19.504152MB; Peak was 160.488012MB
Finished 4
Current memory usage is 19.504793MB; Peak was 160.488012MB
Finished 5
Current memory usage is 19.505434MB; Peak was 160.488012MB
Finished 6
Current memory usage is 19.506139MB; Peak was 160.488012MB
Finished 7
Current memory usage is 19.50706MB; Peak was 160.488012MB
Finished 8
Current memory usage is 19.507703MB; Peak was 160.488012MB
Finished 9
Current memory usage is 19.508346MB; Peak was 160.488012MB
Finished 10
Current memory usage is 19.508989MB; Peak was 160.488012MB
Finished 11
Current memory usage is 3.814633MB; Peak was 160.488012MB
Finished exemplar-001

Old:

Current memory usage is 46.337429MB; Peak was 160.486573MB
Finished 0
Current memory usage is 46.831053MB; Peak was 160.486573MB
Finished 1
Current memory usage is 47.324621MB; Peak was 160.486573MB
Finished 2
Current memory usage is 47.818189MB; Peak was 160.486573MB
Finished 3
Current memory usage is 48.311789MB; Peak was 160.486573MB
Finished 4
Current memory usage is 48.805357MB; Peak was 160.486573MB
Finished 5
Current memory usage is 49.298925MB; Peak was 160.486573MB
Finished 6
Current memory usage is 49.792493MB; Peak was 160.486573MB
Finished 7
Current memory usage is 50.286125MB; Peak was 160.486573MB
Finished 8
Current memory usage is 50.779693MB; Peak was 160.486573MB
Finished 9
Current memory usage is 51.273261MB; Peak was 160.486573MB
Finished 10
Current memory usage is 51.766829MB; Peak was 160.486573MB
Finished 11
Current memory usage is 3.790871MB; Peak was 160.486573MB
Finished exemplar-001
DenisSch commented 4 years ago

The usage seems to stay lower and does not increase as rapidly.

JoshuaHess12 commented 4 years ago

Awesome. I think clearing variables had that effect. How does the parallel implementation look on exemplar data?

On Tue, Mar 24, 2020 at 2:32 PM DenisSch notifications@github.com wrote:

The usage seems to stay lower and does not increase as rapidly.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/labsyspharm/quantification/pull/8#issuecomment-603433008, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALCXKJG2G6XILXWJDQCWZFLRJD4FJANCNFSM4LS3723Q .

ArtemSokolov commented 4 years ago

Do we need to make any changes on the Nextflow side? (i.e. did the command-line interface change?)

DenisSch commented 4 years ago

@ArtemSokolov Command line stays the same. I am testing the branch now before moving it into master

JoshuaHess12 commented 4 years ago

Command line arguments are the same, except there is an additional argument in the parallel implementation to indicate number of jobs to use. There are also temporary folders created to store single plane csv files before combining.

On Tue, Mar 24, 2020 at 2:48 PM Artem Sokolov notifications@github.com wrote:

Do we need to make any changes on the Nextflow side? (i.e. did the command-line interface change?)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/labsyspharm/quantification/pull/8#issuecomment-603440887, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALCXKJEADJL7FK3RZPGNHITRJD57HANCNFSM4LS3723Q .

ArtemSokolov commented 4 years ago

Does the parallel implementation spawn multiple processes by calling the other Python script, or is it a single multi-threaded process?

JoshuaHess12 commented 4 years ago

@ArtemSokolov I am almost certain that it is spawning multiple processes instead of running a single multi-threaded process. The current implementation is using the joblib library with the default parameters (i.e. "loky" backend - https://joblib.readthedocs.io/en/latest/parallel.html). Would you recommend that we use a single multi-threaded process instead?

ArtemSokolov commented 4 years ago

I'm just trying to get a better sense of how it's parallelizing.

I think for local runs, it probably doesn't matter which way you do it. The difference really comes into play on the HMS compute cluster, where you have to specify -c to request multiple cores for a multi-threaded job, or -n to request multiple CPUs for a multi-process one.

Personally, I think multiple processes is more robust, so I wouldn't change it.