Fixed GPU Memory use issue and uncontrolled spawning of cpus and processes.

pnlbwh / CNN-Diffusion-MRIBrain-Segmentation

CNN based brain masking

Other

14 stars 10 forks source link

Fixed GPU Memory use issue and uncontrolled spawning of cpus and processes. #40

Closed RyanZurrin closed 3 months ago

RyanZurrin commented 5 months ago

I was able to fix this issue by setting TF to use Dynamic memory allocation instead of its default which is to allocate all the GPU memory. It usually will do this to prevent memory fragmentation. I have successfully ran two jobs in parallel on a GPU with only 11GB of memory, where before this was not possible.

tashrifbillah commented 4 months ago

Tashrif has committed in Ryan's master branch directly to resolve the above comments.

tashrifbillah commented 4 months ago

Whitespaces apart, the above are the two substantial blocks I could find in this PR. Are there any other blocks I should review? @RyanZurrin

RyanZurrin commented 4 months ago

The only other parts are in the beginning, where it checks what TF version and then imports based on version which is part of the dynamic allocation part, so you have checked everything important.

tashrifbillah commented 4 months ago

Ryan's latest env_build_commands.md did not use GPU either in pnl-predict machine.

RyanZurrin commented 4 months ago

It used GPU when I was testing it. Maybe I can stop by, and we can go through the steps together.

RyanZurrin commented 4 months ago

As I mentioned already when using the pipeline_tests.sh it requires you to have fsl env activated which from my experience was using that env instead of the clean python env I built.

I did install from a clean python env and my bashrc was removed.

RyanZurrin commented 4 months ago

I think the old pipeline_tests.sh requires dependencies that are not needed for the CNN Masking; maybe we can make a cleaner pipeline_tests.sh that does not require so many unneeded dependencies.

tashrifbillah commented 4 months ago

pipeline_test.sh does not use a default environment. You, the user, need to source dcm2niix, ANTs, FSL before it can run. So pipeline_test.sh is not the issue.

RyanZurrin commented 4 months ago

yes when I sourced the FSL, it would take precedence even over my already activated conda env and that pipeline_test.sh would use the python from within the FSL and not the conda.

tashrifbillah commented 4 months ago

Thank you for the hint. I shall double check soon.

tashrifbillah commented 4 months ago

Tashrif's issue was he did not set the LD_LIBRARY_PATH or had a different CUDA-12 installation.

However, Tashrif and Ryan established that the new set of install instructions work on both CentOS 7 and Rocky9 machines.

As one last try, Tashrif will try to environmentalize the install instructions.

tashrifbillah commented 3 months ago

Tashrif is doing one final review of dwi_masking.py before merging it.

tashrifbillah commented 3 months ago

Merging Ryan's work so I have better control at finalizing a few things.