memory issues - Githubissues

Hello,

I am trying to run this model on a cluster with no internet access. I cached the pretrained ESM2 model in advance and just changed the model_location argument to point the cached model. However, now I am having trouble running the model on our cluster. I keep getting the error message below when I run the Extract_15.sh script..

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 400.00 MiB. GPU 0 has a total capacity of 31.73 GiB of which 176.69 MiB is free. Including non-PyTorch memory, this process has 31.56 GiB memory in use. Of the allocated memory 31.26 GiB is allocated by PyTorch, and 558.50 KiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Our cluster uses Nvidia Tesla V100 GPUs. There are 4 GPUs per node. Each GPU has 32 GB memory. This was the slurm header I used for the Extract_15.sh script.

#!/bin/bash
#SBATCH --time=48:00:00 
#SBATCH --job-name=means
#SBATCH --partition=gpu
#SBATCH --ntasks=1
#SBATCH --nodes=1
#SBATCH --cpus-per-task=16
#SBATCH --gres=gpu:tesla_v100:4
#SBATCH --nodelist=mforgegpu1,mforgegpu2
#SBATCH --mem=200GB
#SBATCH --exclusive
#SBATCH --output logs/%x.%j.stdout
#SBATCH --error logs/%x.%j.stderr

Is code in this repo compatible to work with multiple GPUs? Do you have any suggested fixes? I have worked with our IT and haven't been able to figure it out.

idmjky / EvolvePro

memory issues #6