throughput increase non-linearly with number of workers

vandesa003 commented 1 month ago

🐛 Describe the bug

I am hosting a bert-like model using below torchserve config.

inference_address=http://localhost:8080
management_address=http://localhost:8081
metrics_address=http://localhost:8082
load_models=model_name=weights.mar
async_logging=true
job_queue_size=200

models={ "model_name": {  "1.0": { "minWorkers": 8 , "batchSize": 8 , "maxBatchDelay": 10  }  }  }

I have 8 GPUs, this setting will give me 1 worker per gpu.

then I did load test with both k6 and locust, and below shows the relationship between number of workers(from 1 to 8) and throughput. output

As can be seen in the chart, gpu usage is dropping when number of workers increased, so it feels like the load balancer in torchserve leads to the inefficiency. Anyone can give me some clues how can I improve the throughput further?

Error logs

throughput increase non-linearly with number of workers

Installation instructions

torchserve = "^0.10.0"

Model Packaging

torchserve = "0.10.0"

config.properties

inference_address=http://localhost:8080 management_address=http://localhost:8081 metrics_address=http://localhost:8082 load_models=model_name=weights.mar async_logging=true job_queue_size=200

models={ "model_name": { "1.0": { "minWorkers": 8 , "batchSize": 8 , "maxBatchDelay": 10 } } }

Versions

$ python serve/ts_scripts/print_env_info.py

Environment headers

Torchserve branch:

torchserve==0.10.0 torch-model-archiver==0.11.0

Python version: 3.11 (64-bit runtime) Python executable: /home/me/.cache/pypoetry/virtualenvs/pre-deploy-j4GApv9r-py3.11/bin/python

Versions of relevant python libraries: numpy==1.24.3 nvgpu==0.10.0 pillow==10.4.0 psutil==6.0.0 requests==2.32.3 torch==2.3.1+cu121 torch-model-archiver==0.11.0 torch_tensorrt==2.3.0+cu121 torchserve==0.10.0 torchvision==0.18.1 transformers==4.44.2 wheel==0.44.0 torch==2.3.1+cu121 Warning: torchtext not present .. torchvision==0.18.1 Warning: torchaudio not present ..

Java Version:

OS: Debian GNU/Linux 12 (bookworm) GCC version: (Debian 12.2.0-14) 12.2.0 Clang version: 14.0.6 CMake version: version 3.25.1

Is CUDA available: Yes CUDA runtime version: N/A GPU models and configuration: GPU 0: NVIDIA A100-SXM4-80GB GPU 1: NVIDIA A100-SXM4-80GB GPU 2: NVIDIA A100-SXM4-80GB GPU 3: NVIDIA A100-SXM4-80GB GPU 4: NVIDIA A100-SXM4-80GB GPU 5: NVIDIA A100-SXM4-80GB GPU 6: NVIDIA A100-SXM4-80GB GPU 7: NVIDIA A100-SXM4-80GB Nvidia driver version: 550.54.15 cuDNN version: None

Environment: librarypath (LD/DYLD_):

Repro instructions

wget http://mar_file.mar torch-model-archiver ... torchserve --start

Possible Solution

No response

mreso commented 1 month ago

Hi @vandesa003 on the client (l6/locust) side, how many concurrent users/connections do you allow? It looks a bit like you're not providing enough requests to the server and the GPUs are just idling.

vandesa003 commented 1 month ago

hi @mreso , thanks for your reply! I had the same feeling at the beginning, so I tried with different concurrent users during my experiments, from 40 to 200, and there are no difference on final throughput(for k6 will generate more failed responses because of more requests). I also tuned queue size, also did not help on throughput but longer latency.

pytorch / serve