triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
7.72k stars 1.42k forks source link

Route request to model instance running on specified GPU device id #5564

Open Hao-Yan opened 1 year ago

Hao-Yan commented 1 year ago

In our pipeline, we had a GPU decoder which will decode the video on GPU, the decoded raw image stayed on GPU and we will use cuda shared memory and passed to Triton to do the inference. The issue here is that on a machine with multiple GPU, our model will have one model instance on each GPU. When video is decoded on GPU 0, if the request is sent to model instance running on GPU 1, then we have to move data back and forth, which will affect the perf. NOTE: we have our own cuda shared memory library and we used it in python backend to get the image data to do inference. If we used triton API to register cuda shared memory, will it work for that case?

dyastremsky commented 1 year ago

Not as far as I know. Depending on your needs, you could probably get away with loading the model multiple times with the GPU specified in the config for each model. Then, if you had a naming convention that allowed you to know which model name is on which GPU (e.g. model_0 on GPU 0), your client could route to the correct GPU. Since you'll need a copy on each GPU anyway, I don't believe this would affect memory usage much.

CC: @Tabrizian @GuanLuo in case they have additional guidance on how to specify the specific GPU for a request or use CUDA shared memory for this use case in Python backend.

Tabrizian commented 1 year ago

This has been requested in another context too. Unfortunately, this is not possible right now. This enhancement makes sense to me but I don't think we will get to it any time soon.

philipp-schmidt commented 1 year ago

We have the same issue, as already discussed in #5687. I'm moving the discussion here, because the underlying issue is the same and I would like to focus the attention.

We can see significant performance degradation on multi-gpu systems when using cuda shared memory buffers. Using perf_client and maxing out requests, we can see GPU ID 0 sitting at 100% and all the others at 20-30%. Total throughput is significantly lower than using system shared memory. This is of course because perf_client will use a buffer on GPU ID 0 to test, which exactly highlights the underlying issue. For single GPU systems the performance of cuda shared memory is superior to system shared memory.

On the other hand we see very high CPU load when using system shared memory, due to the memory copy to CPU. So this is not an option for us for.

We may try the suggested workaround in the meantime, but it's not really the solution we are looking for. There has to be a way to force the scheduler to respect the memory placement of the buffers. Without this I don't consider Triton to be fully compatible with e.g. Deepstream with the current degradation in performance that we are seeing.

I'm wondering if there is anything that we can do to increase priority on this issue. We have multiple customer deployments that would benefit from this improvement. @Tabrizian any chance you could elaborate what the main challenges would be to support this and how much effort it would take?

Tabrizian commented 1 year ago

Hi @philipp-schmidt, yes, this is an important feature and we're working on it. The main challenge is that the schedulers/rate limiter currently do not take into account where the model instances are located when dispatching the requests. We need to make some changes to support this.

We'll let you know as soon as we have an update about this.

philipp-schmidt commented 10 months ago

Hello @Tabrizian Is there news on this feature? We have implemented the suggested workaround (one model per gpu by naming the model with a gpu suffix) but still got degraded performance. It seems like there is a fundamental performance issue with cuda shared memory on multi-gpu systems.

Tabrizian commented 10 months ago

@philipp-schmidt Could you please share the model repository and the system which you conducted this experiment on? I think creating separate models should provide the perf improvement that you're looking for.

We'll let you know as soon as there is more information regarding this feature.

Without this I don't consider Triton to be fully compatible with e.g. Deepstream with the current degradation in performance that we are seeing.

Could you elaborate more on this? How does this feature help with Triton+Deepstream integration?

monk-after-90s commented 7 months ago

How about now?

philipp-schmidt commented 6 months ago

Hello @Tabrizian and happy new year! Sorry for the late reply, we were pretty busy in Q4. I'm hoping we can address this together this year, we are happy to share our insights.

We have tested this a while ago and came to the conclusion that there must be something off with the scheduling / memory handling when using cuda shared memory. We implemented a test setup in which the cuda buffers would be placed on the correct GPU and then the model on the same GPU would be requested to run inference on the same buffer. This is the setup which was suggested in this thread. So even though the there should be no memcopy involved, we measured up to 25% or worse performance degradation if the GPU ID is not 0, but any other ID. This leads us to believe that the Triton scheduler treats cuda buffers on GPU ID 0 differently from buffers on other GPUs.

With this knowledge we have implemented another workaround, which currently solves our issues. We are happy to get rid of the additional complexity though.

We can provide a more detailed analysis, but to make them reproducible for others we would need the following features added to perf_analyzer:

What are the chances this could make it into https://github.com/triton-inference-server/client ?

Tabrizian commented 5 months ago

We did an investigation about this feature and after doing some analysis we realized that the perf impact could be lower than it may seem initially. This optimization is most likely only useful if 1) you have large tensors 2) low concurrency 3) few ensemble steps.

If the concurrency is high (i.e., all the model instances are busy) it might be better to proceed with the cross GPU data copy rather than waiting for the optimal model instance to become available since usually data copies between GPUs are quite fast (especially if you're using NVLink).

Based on this information we have deprioritized this feature. Would be happy to re-evaluate this if you're aware of use-cases that would greatly benefit from this optimization.

@philipp-schmidt

This leads us to believe that the Triton scheduler treats cuda buffers on GPU ID 0 differently from buffers on other GPUs.

Off the top of my head I can't think of any logic specific to GPU 0. By any chance do you have a small reproducer that we could look into this?

@matthewkotila @tgerdesnv Do you think these features are something that we could add to Perf Analyzer? I believe we already have a ticket for the second request.