Possible Memory Leak or Excessive RAM Usage Error in batch model of colabfold_batch Locally

lshilab commented 8 months ago

Dear Developers,

I am experiencing an issue with the colabfold_batch tool when processing a batch of protein sequences LOCALLY, each around 600 amino acids long. Despite having a system (256GB RAM and 24GB GPU memory), the tool appears to be continuously consuming memory and eventually fails due to what seems like a memory-related error.

Current Behavior

When executing colabfold_batch for a batch of protein sequences, the process starts but gradually accumulates memory usage until it crashes. It seems that the application is not efficiently releasing memory between individual runs or there might be a slow-compilation problem causing excessive memory consumption. I've noticed that the RAM (256GB RAM for me) usage continuously increases over time, even though the GPU memory appears stable. The sequences being processed are relatively moderate in size, which leads me to believe that it's the RAM that cause crash.

Error Log: The error messages received during execution point towards slow compilation processes:

E external/org_tensorflow/tensorflow/compiler/xla/service/slow_operation_alarm.cc:65] 
********************************
[Compiling module jit_apply_fn] Very slow compile?  If you want to file a bug, run with envvar XLA_FLAGS=--xla_dump_to=/tmp/foo and attach the results.
********************************
E external/org_tensorflow/tensorflow/compiler/xla/service/slow_operation_alarm.cc:133] The operation took 2m23.006026321s

********************************
[Compiling module jit_apply_fn] Very slow compile?  If you want to file a bug, run with envvar XLA_FLAGS=--xla_dump_to=/tmp/foo and attach the results.
********************************

These warnings suggest that the compiler is taking an unexpectedly long time, which may contribute to the excessive RAM consumption.

System Configuration & Observations:

My workstation has 256GB of RAM, ensuring ample capacity for handling multiple instances concurrently.
The GPU has 24GB of memory, which should be sufficient for typical protein folding tasks (600AA this time).
During the runs, it is clear that the growth in memory usage is attributed to the system (CPU) RAM rather than the GPU memory.

Request for Assistance: Is there any configuration setting or optimization method that can be applied to colabfold_batch to allow it to run indefinitely without exhausting system memory? Given the size of my hardware resources, I'm looking for guidance on how to prevent memory leaks or optimize the memory allocation strategy for these medium-sized predictions.

Thank you for considering this issue and providing support.

Best regards lshi

milot-mirdita commented 8 months ago

What GPU are you using? The warnings looks like they might come from a Volta or Kepler GPU.

Can you try the --disable-unified-memory flag? That might solve your issue.

You might also have an inconsistent set of cuda libraries installed. You can try installing Colabfold from bioconda to ensure it uses a consistent set of dependencies:

conda create -n colabfold -c conda-forge -c bioconda colabfold jaxlib=*=cuda* nvidia::cuda-nvcc

lshilab commented 8 months ago

What GPU are you using? The warnings looks like they might come from a Volta or Kepler GPU.

I would like to thank you for your prompt response and valuable suggestions. The GPU currently employed is a Tesla P40, which boasts 24GB of video memory and operates on the Pascal architecture. Despite successful runs, I encounter an issue where after processing approximately 3000 predictions, the program terminates due to exhausting system RAM.

Can you try the --disable-unified-memory flag? That might solve your issue.

I am unsure how to implement this setting. My online research did not yield a clear method, and I wonder if it should be set via an environment variable such as export XLA_FLAGS.

You might also have an inconsistent set of cuda libraries installed. You can try installing Colabfold from bioconda to ensure it uses a consistent set of dependencies:
conda create -n colabfold -c conda-forge -c bioconda colabfold jaxlib=*=cuda* nvidia::cuda-nvcc
I will give this installation a try. However, before proceeding, I am keen to explore the --disable-unified-memory configuration first. Could you kindly provide more specific guidance on how to set the --disable-unified-memory flag? I try to find out cuda libraries installed in my computer, is it as follows?
nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0

ldconfig -p | grep cuda libicudata.so.70 (libc6,x86-64) => /lib/x86_64-linux-gnu/libicudata.so.70 libcudadebugger.so.1 (libc6,x86-64) => /lib/x86_64-linux-gnu/libcudadebugger.so.1 libcuda.so.1 (libc6,x86-64) => /lib/x86_64-linux-gnu/libcuda.so.1 libcuda.so (libc6,x86-64) => /lib/x86_64-linux-gnu/libcuda.so



I am concerned about not having articulated the issue adequately. Allow me to reiterate: Based on my observations, the GPU memory usage remains quite low despite running sequences of approximately 600 amino acids in length using colabfold_batch in a batch mode. I assume that there should be a mechanism in place to clear the GPU memory after each prediction, and indeed,  I have not seen any instances where the GPU memory is fully utilized or exhausted during these batch runs.

However, a contrasting pattern emerges with system memory consumption; it gradually increases as the batch processing continues, eventually leading to exhaustion. The reason for this persistent accumulation of system RAM usage is unclear to me, and seemingly, there seems to be no inherent mechanism in operation to release or reset the system memory following each prediction cycle.

Regarding the error message previously mentioned, it has only manifested once thus far, while other errors have not been logged in the error messages (when running via nohup). Please accept my apologies for any lack of technical depth, and I sincerely appreciate your provision of such an excellent software tool.

milot-mirdita commented 8 months ago

The --disable-unified-memory parameter is part of colabfold_batch

You can call:

colabfold_batch --disable-unified-memory other-parameters...

We haven't tested on GPUs older than Turing/Volta in a while. This might be a Pascal specific issue that we probably cannot solve.

dthorburn commented 4 months ago

I've noticed this as well on GCP VMs with colabfold 1.5.2 (356454672e20cdfa4d3692156dbc70cdb323fdbe). This is true for both Nvidia T4, L4s, and a local RTX 4090 (latter 2 are Ada Lovelace architechture).

I tested adding --disable-unified-memory to an L4 job, and it did complete 346 complexes with lengths ranging from 900-2200 AAs. Whilst the memory did perform a lot better in this job compared to others without the parameter, the free and available memory generally trended towards 0 throughout the ~70 hours it was running. Again, the parameter appears to have helped significantly, but there does still appear to be a memory leak with Ada Lovelace architecture.

I'll eventually get around to updating my VM images to a more recent version of colabfold and will test more instances and report back if there are different results.

sokrypton / ColabFold

Possible Memory Leak or Excessive RAM Usage Error in batch model of colabfold_batch Locally #584

Current Behavior