I am trying to run this model on a cluster with no internet access. I cached the pretrained ESM2 model in advance and just changed the model_location argument to point the cached model. However, now I am having trouble running the model on our cluster. I keep getting the error message below when I run the Extract_15.sh script..
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 400.00 MiB. GPU 0 has a total capacity of 31.73 GiB of which 176.69 MiB is free. Including non-PyTorch memory, this process has 31.56 GiB memory in use. Of the allocated memory 31.26 GiB is allocated by PyTorch, and 558.50 KiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Our cluster uses Nvidia Tesla V100 GPUs. There are 4 GPUs per node. Each GPU has 32 GB memory. This was the slurm header I used for the Extract_15.sh script.
Is code in this repo compatible to work with multiple GPUs? Do you have any suggested fixes? I have worked with our IT and haven't been able to figure it out.
Hello,
I am trying to run this model on a cluster with no internet access. I cached the pretrained ESM2 model in advance and just changed the model_location argument to point the cached model. However, now I am having trouble running the model on our cluster. I keep getting the error message below when I run the Extract_15.sh script..
Our cluster uses Nvidia Tesla V100 GPUs. There are 4 GPUs per node. Each GPU has 32 GB memory. This was the slurm header I used for the Extract_15.sh script.
Is code in this repo compatible to work with multiple GPUs? Do you have any suggested fixes? I have worked with our IT and haven't been able to figure it out.