thammegowda / nllb-serve

Meta's "No Language Left Behind" models served as web app and REST API
http://rtg.isi.edu/nllb/
183 stars 27 forks source link

Cuda out of memory #5

Closed saitanay closed 1 year ago

saitanay commented 1 year ago

Running on a decent sized machine. Using REST API. NOt Batch mode

NVIDIA - 2xT4 32 GB GPU, 24 Core, 100GB RAM

Running into below error once in a while. Initially used to happen with larger 3.1B models, but seems to be happening with 1.3B (undistilled) models also these days.

  File "/usr/local/lib/python3.10/dist-packages/transformers/models/m2m_100/modeling_m2m_100.py", line 229, in _shape
    return tensor.view(bsz, seq_len, self.num_heads, self.head_dim).transpose(1, 2).contiguous()
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 58.00 MiB (GPU 0; 14.75 GiB total capacity; 13.96 GiB already allocated; 28.81 MiB free; 14.04 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
INFO:werkzeug:206.189.138.109 - - [19/May/2023 10:44:52] "POST /translate HTTP/1.1" 500 -

What have i tried?

Tried setting TORCH_MAX_SPLIT_SIZE_MB to lower values. But did not find any difference

export TORCH_MAX_SPLIT_SIZE_MB=512 export TORCH_MAX_SPLIT_SIZE_MB=256

FYIIW, the machine doesnt host anything else, just this. I run it with nohup nllb-serve -p 80 -mi facebook/nllb-200-1.3B > nllb.log 2>&1 & or nohup nllb-serve -p 80 -mi facebook/nllb-200-3.3B > nllb.log 2>&1 &

thammegowda commented 1 year ago

It looks you have 2x T4 GPUs, each with 16GB RAM. And 16GB isn't enough to use 1.3B param model, especially for long sentences. Same is for 3.3B param model. You may get lucky with short sentences once in a while, but when input is long, you will cuda see out of memory. The current code is NOT designed to split model and input across two GPUs. It uses one GPUs memory.

So here are 4 options:

  1. Try smaller model i.e. 600M parameters. Its already the default.
  2. Use 1 big GPU instead of 2x smaller GPUs. Eg. V100 has 32GB. GTX3090/4090 have 24GB. A6000 has 48GB. A100 has 40GB, to name a few.
  3. Change maximum source sentence length to be a max of 100 tokens or length. Currently 256 is the default. https://github.com/thammegowda/nllb-serve/blob/024f703bb6e3f2ebe59f39cbe7f080e052ab0b80/nllb_serve/app.py#L166-L167
  4. Edit the code to make this work on multiple GPUs. (Sorry, this is not on my priority and it can be more time consuming) -- pull requests are welcome. https://github.com/thammegowda/nllb-serve/blob/024f703bb6e3f2ebe59f39cbe7f080e052ab0b80/nllb_serve/app.py#L138-L140
saitanay commented 1 year ago

Thanks a lot. Will try out these.