Closed saitanay closed 1 year ago
It looks you have 2x T4 GPUs, each with 16GB RAM. And 16GB isn't enough to use 1.3B param model, especially for long sentences. Same is for 3.3B param model. You may get lucky with short sentences once in a while, but when input is long, you will cuda see out of memory. The current code is NOT designed to split model and input across two GPUs. It uses one GPUs memory.
So here are 4 options:
Thanks a lot. Will try out these.
Running on a decent sized machine. Using REST API. NOt Batch mode
NVIDIA - 2xT4 32 GB GPU, 24 Core, 100GB RAM
Running into below error once in a while. Initially used to happen with larger 3.1B models, but seems to be happening with 1.3B (undistilled) models also these days.
What have i tried?
Tried setting TORCH_MAX_SPLIT_SIZE_MB to lower values. But did not find any difference
export TORCH_MAX_SPLIT_SIZE_MB=512 export TORCH_MAX_SPLIT_SIZE_MB=256
FYIIW, the machine doesnt host anything else, just this. I run it with
nohup nllb-serve -p 80 -mi facebook/nllb-200-1.3B > nllb.log 2>&1 &
ornohup nllb-serve -p 80 -mi facebook/nllb-200-3.3B > nllb.log 2>&1 &