triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
8k stars 1.44k forks source link

Low Throughput due to futex contention causing large Wait times #7041

Open nubol23 opened 5 months ago

nubol23 commented 5 months ago

Description We have three Triton containers with different set of tensorflow models each, grouped by use case, let's say containers A (76 models), B (22 models) and C (24 models). Containers A and C have a good end expected performance, ~7k Queries Per Second, however container B has a lower than expected throughput, ~2-3k QPS.

We are running on an Kubernetes cluster of c7i.8xlarge instances consisting of 32 CPU and 64 GiB Memory of which it is only using ~7 CPU.

We ran a benchmark using K6 with a Go client and ran Nsys profiler in the containers. In our benchmark, we round robin across all models in the Triton pod.

Exploring the threads timeline we see that there are a lot of locking nsync calls

libc.so.6!syscall
libtensorflow_framework.so.2!nsync::nsync_mu_semaphore_p_with_deadline(...)
libtensorflow_framework.so.2!nsync::nsync_cv_wait_with_deadline_generic(...)

These calls in container B are much more common and longer than in the other containers, between 100 - 200 ms vs microseconds for the other ones, the wait times are visible as gaps in the thread timeline: Screenshot from 2024-03-25 17-43-13 (The high CPU utilization at the start is due to Tensorflow's lazy loading but then the average utilization drops as observed)

We also noticed that average CPU utilization is very low on all containers ~15% Screenshot from 2024-03-26 16-04-06

Note that even though container B has less models than the other containers, it performs worse with less QPS, we are attributing this to the low CPU utilization and long wait times, but don't know how to reduce the lock times and make sure the models are using all available resources in our Triton pods.

Here I also attach a screenshot of the timeline for container A (we have similar results for container B) Screenshot 2024-03-25 at 17 43 51 we can see that it also has a low CPU utilization but doesn't have those big gaps with nsync locks and as described above, they perform much better.

Also, is there a recommended number of models/max models per container?

Triton Information

To Reproduce Can't provide a way to reproduce as models are private, but let us know what you would need and we can work on some reproducible scripts with public models.

We use Tensorflow based models with a very basic configuration:

backend: "tensorflow"
platform: "tensorflow_savedmodel"
max_batch_size: 16
dynamic_batching {
}
instance_group {
  count: 3
  kind: KIND_CPU
}

Also ran with some variations of intra and inter op values using a simple grid search but the throughput didn't improve at all.

Expected behavior We want to maximize CPU utilization and reduce waiting times during inference. Even a way to debug this would be very helpful.

pvijayakrish commented 5 months ago

@debermudez is this something the tools team can help with?

nubol23 commented 4 months ago

We computed some metrics for time spent in different Futex operations and found that a lot of time is spent in mu_locks and semaphores during tensorflow deallocations, is this useful to shed some light onto this problem @pvijayakrish @debermudez?

debermudez commented 4 months ago

@nubol23 this is helpful yes but probably not the domain of the tools team. Lets loop in server folks and see if we can get to the bottom of this. @Tabrizian do you know who is the right contact for this?

AshwinAmbal commented 4 months ago

cc: @tanmayv25