idmjky / EvolvePro

PLM based active learning model for protein engineering
Other
57 stars 9 forks source link

memory issues #6

Closed kenneditodd closed 2 months ago

kenneditodd commented 2 months ago

Hello,

I am trying to run this model on a cluster with no internet access. I cached the pretrained ESM2 model in advance and just changed the model_location argument to point the cached model. However, now I am having trouble running the model on our cluster. I keep getting the error message below when I run the Extract_15.sh script..

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 400.00 MiB. GPU 0 has a total capacity of 31.73 GiB of which 176.69 MiB is free. Including non-PyTorch memory, this process has 31.56 GiB memory in use. Of the allocated memory 31.26 GiB is allocated by PyTorch, and 558.50 KiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Our cluster uses Nvidia Tesla V100 GPUs. There are 4 GPUs per node. Each GPU has 32 GB memory. This was the slurm header I used for the Extract_15.sh script.

#!/bin/bash
#SBATCH --time=48:00:00 
#SBATCH --job-name=means
#SBATCH --partition=gpu
#SBATCH --ntasks=1
#SBATCH --nodes=1
#SBATCH --cpus-per-task=16
#SBATCH --gres=gpu:tesla_v100:4
#SBATCH --nodelist=mforgegpu1,mforgegpu2
#SBATCH --mem=200GB
#SBATCH --exclusive
#SBATCH --output logs/%x.%j.stdout
#SBATCH --error logs/%x.%j.stderr

Is code in this repo compatible to work with multiple GPUs? Do you have any suggested fixes? I have worked with our IT and haven't been able to figure it out.

idmjky commented 2 months ago

Hi, the code in the repo is only designed for single gpu. You will need to change the extract embedding python files to be compatible with that.