triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
8.18k stars 1.46k forks source link

Slow Inference using Triton Java Bindings #5593

Open Backalla opened 1 year ago

Backalla commented 1 year ago

Description I am trying to run inference using the Java Bindings for Triton and I see that the inference speed is slow. 200+ ms per inference. Whereas using perf_analyzer on same model server using the Triton container gives 16K+ reqs/sec inference speed. The implementation is heavily inspired from the Simple.java example code provided with the javacpp-presets.

I have just been timing the inference requests using System.nanoTime(). Upon profiling, I see that majoring of the time is spent in waiting for the inference to complete. I see that preparing inputs and inference request is quite fast. image

Triton Information What version of Triton are you using? Triton Server Version : 2.30.0

Are you using the Triton container or did you build it yourself? Using the Triton Java Bindings

To Reproduce Steps to reproduce the behavior.

  1. Clone the following repository: https://github.com/Backalla/triton-playground.git
  2. Build the jar using sbt assembly command.
  3. Run the Debug class by running java -cp triton-scala.jar org.booking.recommended.Debug lgbm

You can check the full logs here.

Describe the models (framework, inputs, outputs), ideally include the model configuration file (if using an ensemble include the model configuration file for that as well). This is a simple LightGBM model used for testing Triton. Check the model and config files here. This model takes 32 FP32 features and produces 1 FP32 output.

Expected behavior A clear and concise description of what you expected to happen. I expect the inference to be lot faster than 200ms. Running the same model using the LightGBMLib directly does <10ms per prediction. I understand that the Simple.java code is just for illustration and not expected to run at production scale. I feel the memory management done at inference in my code is the culprit. I tried to replicate the inference code in http_server.cc but couldn't understand the memory management being done there. It would also help if some documentation about this can be added.

Thanks :)

rmccorm4 commented 1 year ago

Hi @Backalla,

Thanks for the detailed issue.

I feel the memory management done at inference in my code is the culprit

In a comparison between an example modified from Simple.java and perf_analyzer, I suspect something similar without having looked deeply at the code.

CC @jbkyang-nvi @dyastremsky who may have some insights on the Java/javacpp side of things.

dyastremsky commented 1 year ago

Have you tried giving the Java program more memory to reduce the frequency of garbage collection to see if that speeds up inference time? E.g. -Xmx256M -Xms256M.

Backalla commented 1 year ago

Hello @rmccorm4, Thanks for looking into it. I am also not sure if the way I am testing it is correct.

In a comparison between an example modified from Simple.java and perf_analyzer, I suspect something similar without having looked deeply at the code.

I tried to understand how memory management is done from the http_server.cc but failed to replicate that in Java. It would help if there is any kind of benchmarking done for the javacpp inference.

Backalla commented 1 year ago

Hello @dyastremsky, Thanks for the pointer. I tried with more memory and I see that nothing changed in the inference speed. I see in the resource usage that memory is not being the bottleneck.

Backalla commented 1 year ago

To add some more context. I tried increasing the number of cpus and memory and it slightly helped. Currently running on a pod with 8 CPUs and 2500Mi memory. The inference time came down to 80-100ms which is still faster than the numbers mentioned in the issue which were using 2 CPUs.

Backalla commented 1 year ago

A small update on this. I tried with a tensorflow model. A simple 3 layer model that adds 2 numbers. The inference was really fast. <1ms per inference. I will do a proper benchmark once I create the http interface around this. There is a small problem though. The predictions are some times random numbers like 2.80352464E14. This model works perfectly with tensorflow serving and even Java tensorflow library. I suspect that there could be some issue in reading the output using TRITONSERVER_InferenceResponseOutput from the memory location that was deallocated by tritonserver and hence its reading garbage. Not sure if that makes sense. Please find the complete test logs here. Would really help if there are any pointers around this. Thanks.

dyastremsky commented 1 year ago

Thanks for getting numbers and providing a theory, Tushar. We've filed a ticket to investigate.

If you're able to provide a repro for the random number case, that'd be great. We'll look into this.

doracsillag commented 1 year ago

@Backalla, can you provide your CPU model/specification from which you got the first output logs ? you mentionned 2 Cpus, how many cores/threads in total ? Thanks!

Backalla commented 1 year ago

Thanks @dyastremsky. Steps to reproduce the tensorflow random output problem are as follows

  1. Clone the following repository: https://github.com/Backalla/triton-playground.git
  2. cd in to triton-scala directory.
  3. Build the jar using sbt assembly command.
  4. Run the Debug class by running java -cp triton-scala.jar org.booking.recommended.Debug tf Find the logs of the above steps here
Backalla commented 1 year ago

Hello @doracsillag, I meant 2 CPU cores and 2 threads per core. The CPU model is Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz

miguelusque commented 1 year ago

Hi @Backalla ,

Apologies, but I have yet some doubts about how you are executing your tests.

The CPU model you mentioned above has 8 cores (16 threads).

So, if I understand correctly, you are running the code in 2 CPUs. So you are running the inference in 32 threads (2 CPUs x 16 thread each CPU).

Could you please confirm if the above is correct? If not, may I ask how are you limiting the number of threads used in your experiments? Thanks!!!

Regards, Miguel

Backalla commented 1 year ago

Hello @miguelusque , thanks for looking into this. The environment that I am running is a kubernetes pod that has 2 cores assigned to it. The pod cannot access all the available cores in the kubernetes node other than the 2 cores assigned to it. As I mentioned in the above comment, I tried increasing these resources to 8 cores and the performance improved slightly. But still ~100ms per inference.