Out of Memory Error When Using BrainGenerator for Extended Periods

spikedoanz commented 4 months ago

Description

When using the BrainGenerator from the nobrainer library, I'm encountering an Out of Memory (OOM) error. The error occurs during the generate_brain() method call, running on a loop for over 30 minutes

Environment

Operating System: Ubuntu 20.04.6 LTS x86_64
Python version: Python 3.10.14
nobrainer.version: '1.2.1+18.g0c02590' (synthseg branch)
TensorFlow.version: '2.15.1'
GPU model: A40, and 2080ti
CUDA version: 12.4

Script using BrainGenerator

from nobrainer.processing.brain_generator import BrainGenerator

DATA_FILES          = ["example.nii.gz"]

training_seg = DATA_FILES[0]
brain_generator = BrainGenerator(
    training_seg,
    randomise_res=False,
)
print(f"Generator: SynthSeg is using {training_seg}")
while True:
    img, lab = brain_generator.generate_brain()

Error

  2024-07-19 18:08:13.517419: I external/local_tsl/tsl/framework/bfc_allocator.cc:1114] Stats:
Limit:                      9187426304
InUse:                      8966296064
MaxInUse:                   9054715648
NumAllocs:                      665672
MaxAllocSize:                275268352
Reserved:                            0
PeakReserved:                        0
LargestFreeBlock:                    0

2024-07-19 18:08:13.517447: W external/local_tsl/tsl/framework/bfc_allocator.cc:497] ***************************************************************************************************x
2024-07-19 18:08:13.517479: W tensorflow/core/framework/op_kernel.cc:1839] OP_REQUIRES failed at gather_op.cc:158 : RESOURCE_EXHAUSTED: OOM when allocating tensor with shape[150,256,256,3] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
2024-07-19 18:08:13.517585: W external/local_tsl/tsl/framework/bfc_allocator.cc:485] Allocator (GPU_0_bfc) ran out of memory trying to allocate 112.50MiB (rounded to 117964800)requested by op model/random_spatial_deformation/resize_1/map/while/body/_71/model/random_spatial_deformation/resize_1/map/while/GatherV2_7
If the cause is memory fragmentation maybe the environment variable 'TF_GPU_ALLOCATOR=cuda_malloc_async' will improve the situation.

The above error only happens 30 minutes (or after 1000 samples generated)
It happens on both A40 and 2080ti GPUs

hvgazula commented 4 months ago

@spikedoanz I was hoping this to be a SynthSeg issue 😁. Unfortunately, @sergeyplis mentioned that this was not an issue with the original synthseg but only with the one within nobrainer. Let's see what we can do about it.

Clearly, the error states that the problem is with random spatial deformation layer. But before we go that far, could you try rerunning your code with the following snippet at the top of your script? Also, do you have the memory utilization curves for the GPU?

physical_devices = tf.config.list_physical_devices("GPU")
try:
    tf.config.experimental.set_memory_growth(physical_devices[0], True)
except:
    pass

hvgazula commented 4 months ago

Okay, I just noticed..the memory is indeed doubling every 100 examples or so...when I first started the code it was at ~4k, jumped to 8k, and stayed there for a bit before going up to 16k..And I am sure this will keep going. I tried running the one in SynthSeg but have some mismatch with CUDA/cudNN not detecting the GPU. So, it will take a bit to find the right config again.

hvgazula commented 4 months ago

I guess I spoke too soon earlier. I generated close to 1200 samples with no growth in memory usage from the last snapshot (see above). So, is it fair to call it a no-issue now?

spikedoanz commented 4 months ago

I'll test this again with the extra lines you added

spikedoanz commented 4 months ago

okay it lasted longer this time (2044 samples), but it still self terminated at the end. Memory usage also jumped 5G -> 9G -> 12G -> 15G and so on throughout.

hvgazula commented 4 months ago

Is it possible for you to jot down the intervals (or sample idx) at which these growths happened?

spikedoanz commented 4 months ago

overall, cpu memory slowly climbs from 4G -> 18G, though with some dips here and there

at 52 seconds (sample 36), memory doubles from 4->8G (the first number is time since script started)

52.03 CPU: 4633.57 MB. GPU: 4405.00 MB 1/1 [==============================] - 1s 795ms/step 53.65 CPU: 5976.65 MB. GPU: 8503.00 MB 1/1 [==============================] - 1s 510ms/step

at 700 seconds (sample 536), memory doubles again

698.85 CPU: 10029.86 MB. GPU: 8503.00 MB 1/1 [==============================] - 1s 503ms/step 700.13 CPU: 10125.91 MB. GPU: 8503.00 MB 1/1 [==============================] - 0s 458ms/step 701.39 CPU: 10269.03 MB. GPU: 8503.00 MB 1/1 [==============================] - 1s 568ms/step 702.76 CPU: 10636.59 MB. GPU: 16695.00 MB 1/1 [==============================] - 1s 502ms/step 704.06 CPU: 10796.60 MB. GPU: 16695.00 MB 1/1 [==============================] - 1s 500ms/step 705.41 CPU: 9996.18 MB. GPU: 16695.00 MB

at 1300 seconds, exactly sample 1000, the program self terminates, despite not being out of memory

1/1 [==============================] - 0s 494ms/step 1294.58 CPU: 17998.53 MB. GPU: 16695.00 MB 1/1 [==============================] - 0s 496ms/step 1295.84 CPU: 18143.18 MB. GPU: 16695.00 MB 1/1 [==============================] - 0s 494ms/step 1297.08 CPU: 18255.30 MB. GPU: 16695.00 MB

See full log for more details: nobrainer-synthseg.log

hvgazula commented 4 months ago

Thanks. Can you please point me to the snippet of code to get these numbers (from CPU and GPU), so I can run the same on my end as well?

spikedoanz commented 4 months ago

from time import time
import os

import psutil
import GPUtil

import tensorflow as tf
from nobrainer.processing.brain_generator import BrainGenerator

physical_devices = tf.config.list_physical_devices("GPU")
try:
    tf.config.experimental.set_memory_growth(physical_devices[0], True)
except:
    pass

def get_memory_usage():
    # Get CPU memory usage
    process = psutil.Process(os.getpid())
    cpu_mem = process.memory_info().rss / 1024 / 1024  # in MB

    # Get GPU memory usage
    gpus = GPUtil.getGPUs()
    gpu_mem = 0
    if gpus:
        gpu = gpus[0]  # Assuming you're using the first GPU
        gpu_mem = gpu.memoryUsed

    return cpu_mem, gpu_mem

# Get and print memory usage
cpu_usage, gpu_usage = get_memory_usage()
print(f"CPU Memory Usage: {cpu_usage:.2f} MB")
print(f"GPU Memory Usage: {gpu_usage:.2f} MB")

brain_generator = BrainGenerator(
    "example.nii.gz",
    randomise_res=False,
)
start = time()
while True:
    img, lab = brain_generator.generate_brain()
# Get and print memory usage
    cpu_usage, gpu_usage = get_memory_usage()
    print(f"{(time()-start):.2f} CPU: {cpu_usage:.2f} MB. GPU: {gpu_usage:.2f} MB")

hvgazula commented 4 months ago

Sorry to bother you again, could you please generate a similar log for standalone synthseg as well? I am having trouble detecting the GPUs on openmind with the original virtual environment. While I managed to do so on my machine at LCN, the generation is excruciatingly slow (stuck at generation). I reckon it may have to do with the updated drivers but I could be wrong. 🤷‍♂️

hvgazula commented 4 months ago

PS: I am afraid I won't be of much help if I cannot set things up to reproduce the issue on my end. 😐

hvgazula commented 4 months ago

Okay, two things-

I managed to detect a GPU with CUDA 10.1 and CUDNN-10.1(7.6.4) on opendmind. This, so I could generate samples in the original SynthSeg environment using TF 2.2.0 and as I mentioned earlier, the generation is stuck.
I used the current nobrainer environment (TF 2.15 and the latest and greatest CUDA/CUDNN as provided by pip install tensorflow[and-gpu]) and observed a similar memory growth with the standalone synthseg.

Point 2 coupled with how I integrated synthseg (tentatively) points to TF (not nobrainer) as the source of memory leak. Only a successful run of point 1 (running the standalone synthseg on a GPU with old drivers) can prove or disprove this with certainty.

Once again, Are you 100% sure this leak wasn't observed with the older version of TF 2.2?

Thoughts from anyone reading this are appreciated. :)

hvgazula commented 4 months ago

I think I found the source of the leak. The keyword is "generator" :). The concept of "gc" takes a hit with generators for obvious reasons- "state preservation". The predict method in tf/keras creates a data generator with one data item each time it is called and doesn't release memory. So, changing here

[image, labels] = self.labels_to_image_model.predict(model_inputs)
yield image, labels

to

[image, labels] = self.labels_to_image_model(model_inputs)
yield image.numpy(), labels.numpy()

will do the job.

See

and

You will notice that the GPU memory jumped after one sample..but otherwise it is consistent from there on...and more importantly there is no increase in the CPU memory. It is pretty stable unlike what has been seen before.

For more info, please refer to https://stackoverflow.com/questions/64199384/tf-keras-model-predict-results-in-memory-leak

PS: My experimentation is not extensive but please test this rigorously and close this issue when you are happy with the results. If you concur with my observations, I will reach out to Benjamin or Eugenio about this leak. I wonder what implications it has for training.

hvgazula commented 4 months ago

Update: Updated the previous comment to add yield image.numpy(), labels.numpy() as well.

hvgazula commented 4 months ago

Another update: I only tested the code with one input. I don't know where things will break if a bunch of label maps (self.batch_size > 1) are provided. Probably, your use case entails providing a list of label maps. Once you test that, we can reach out to B and E.

hvgazula commented 4 months ago

The one drawback of the fix is "lost time". The fix is a bit slower than the original piece. This is because we are converting the eager tensors to numpy every time.

PS: Ignore the 01 in the y-axis tick labels. Read it as time from 00 to 4 hours of running the code.

spikedoanz commented 4 months ago

Okay, two things-

I managed to detect a GPU with CUDA 10.1 and CUDNN-10.1(7.6.4) on opendmind. This, so I could generate samples in the original SynthSeg environment using TF 2.2.0 and as I mentioned earlier, the generation is stuck.

I used the current nobrainer environment (TF 2.15 and the latest and greatest CUDA/CUDNN as provided by pip install tensorflow[and-gpu]) and observed a similar memory growth with the standalone synthseg.

Point 2 coupled with how I integrated synthseg (tentatively) points to TF (not nobrainer) as the source of memory leak. Only a successful run of point 1 (running the standalone synthseg on a GPU with old drivers) can prove or disprove this with certainty.

Once again, Are you 100% sure this leak wasn't observed with the older version of TF 2.2?

Thoughts from anyone reading this are appreciated. :)

I'll have to find a way to replicate this test with OG SynthSeg to give you a conclusive answer, since I'm having issues installing SynthSeg on a fresh environment on my end also.

But for what it's worth, during my benchmarks using SynthSeg for Wirehead, it successfully ran for > 24 hours regularly with no issues

hvgazula commented 4 months ago

Okay, two things-

I managed to detect a GPU with CUDA 10.1 and CUDNN-10.1(7.6.4) on opendmind. This, so I could generate samples in the original SynthSeg environment using TF 2.2.0 and as I mentioned earlier, the generation is stuck.

I used the current nobrainer environment (TF 2.15 and the latest and greatest CUDA/CUDNN as provided by pip install tensorflow[and-gpu]) and observed a similar memory growth with the standalone synthseg.

Point 2 coupled with how I integrated synthseg (tentatively) points to TF (not nobrainer) as the source of memory leak. Only a successful run of point 1 (running the standalone synthseg on a GPU with old drivers) can prove or disprove this with certainty. Once again, Are you 100% sure this leak wasn't observed with the older version of TF 2.2? Thoughts from anyone reading this are appreciated. :)

I'll have to find a way to replicate this test with OG SynthSeg to give you a conclusive answer, since I'm having issues installing SynthSeg on a fresh environment on my end also.

But for what it's worth, during my benchmarks using SynthSeg for Wirehead, it successfully ran for > 24 hours regularly with no issues

Indeed, I too am stuck with env/gpu issues trying to get this in TF 2.2 with the OG SynthSeg. So, unfortunately and respectfully, I will have to disregard your finding until I reproduce and see the issue myself. :)

hvgazula commented 4 months ago

Updated plots with another version- explicit garbage collection (gc.collect()) after image, labels = next(brain_generator) (here).

If there are concerns about the green line (cpu for gc) causing OOM, you may want to try another gc.collect after the model.predict call and see how that performs.

spikedoanz commented 4 months ago

it works!!

here's 8 generators running in parallel on one A40!

using the upstreamed 'synthseg' branch of nobrainer + inserting gc.collect() into the generation loop

cpu memory does leak, but it peaks at 35 gigs before cleaning up after itself and going back to 25 gigs

gpu memory is stable

hvgazula commented 4 months ago

35G sounds like a lot but after how many hours/samples was that? And, how many times did you observe the gc collect cleaning up after itself?
Any news on this issue in the OG Synthseg (TF 2.2)? That's the only thing I care more about now. 😁

hvgazula commented 4 months ago

side note: what timezone is this work being done in 🤔? I see 7:00 pm in the snapshot. :)

spikedoanz commented 4 months ago

side note: what timezone is this work being done in 🤔? I see 7:00 pm in the snapshot. :)

I don't know actually. this cluster is supposedly in the same time zone as me, but now i'm unsure lol

spikedoanz commented 4 months ago

35G sounds like a lot but after how many hours/samples was that? And, how many times did you observe the gc collect cleaning up after itself?

Any news on this issue in the OG Synthseg (TF 2.2)? That's the only thing I care more about now. 😁

logs are betraying my senses. I'm supposedly getting only 2.6 samples per second with this setup.
I'll get back to you on this once I can actually get it installed, fresh install according to repo instructions gives me this wall of errors synthseg_error.txt

hvgazula commented 4 months ago

Looks like everyone is happy with where things stand with the fix. I am closing this for now.

hvgazula commented 3 months ago

@spikedoanz Benjamin and Eugenio are going to incorporate explicit garbage collection in their code. So, can you please confirm if you added a second gc.collect() or if one would suffice in the generate_brain() function as initially suggested? In case you added the second one, did you time it so we know there is an obvious benefit?

spikedoanz commented 3 months ago

@hvgazula I did some informal measurements of doing the second gc.collect() vs not. Time diff is measuring end to end measured time between samples

without: ~1.25 seconds / sample

Time diff: 1.2285664081573486
1/1 [==============================] - 1s 508ms/step
Time diff: 1.2833566665649414
1/1 [==============================] - 1s 501ms/step
Time diff: 1.2467491626739502
1/1 [==============================] - 1s 503ms/step
Time diff: 1.2522339820861816
1/1 [==============================] - 1s 501ms/step
Time diff: 1.2513880729675293
1/1 [==============================] - 0s 500ms/step

with: ~1.38 seconds / sample

1/1 [==============================] - 1s 508ms/step
Time diff: 1.3852481842041016
1/1 [==============================] - 0s 498ms/step
Time diff: 1.371424913406372
1/1 [==============================] - 1s 504ms/step
Time diff: 1.413076639175415
1/1 [==============================] - 0s 499ms/step
Time diff: 1.3753676414489746
1/1 [==============================] - 1s 504ms/step
Time diff: 1.402883529663086
1/1 [==============================] - 0s 497ms/ste

TLDR: about a 10% decrease in throughput, for a bit of extra stability

the first gc.collect() stopped synthseg from OOM overall, and the second gc.collect() decreases memory variance. Since I've been running 8-10 instances of synthseg on a single node recently, the extra stability is definitely helpful. (if i remove the second gc.collect(), about half of my synthseg instances sometimes dies randomly due to sudden memory spikes)

hvgazula commented 3 months ago

Okay, I will go ahead with your recommendation then- gc.collect at two different places.

neuronets / nobrainer