triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
8.25k stars 1.47k forks source link

Python backend on CPU is slower when serving a pytorch model #3386

Closed SaratM34 closed 2 years ago

SaratM34 commented 3 years ago

Description I have a python model that uses pre-trained roberta model for the inference. I have added this model to Triton to use python backend to serve. We also have the exact same python code/model being served using an fastapi application. Both are running on hardware with same specs. When I compared both the models in terms of performance on CPU, the latency with Triton is very high. I used pytorch profiler to profile the code to debug what is causing the higher latencies with Triton. Below screenshots shows the outputs of pytorch profiler.

Triton-CPU

triton-cpu

FastAPI-CPU

api-cpu

Based on the screenshots I can see that particularly the native_layer_norm is taking significantly longer with Triton when compared with model running using our fastapi application. native_layer_norm is part of the pre-trained roberta model.

Triton Information What version of Triton are you using? Version: 21.07

Are you using the Triton container or did you build it yourself? I built the image myself based on r21.07 but I have also tested serving the model using Official Triton Containers-r21.07 and r21.08 the issue still remains the same

To Reproduce Steps to reproduce the behavior.

Describe the models (framework, inputs, outputs), ideally include the model configuration file (if using an ensemble include the model configuration file for that as well).

Dependencies: torch==1.6.0 transformers==3.5.1

config.pbtxt

name: "sample-model"
backend: "python"
max_batch_size: 8

input [
  {
    name: "INPUT0"
    data_type: TYPE_STRING
    dims:  [1]

  }
]

output [
  {
    name: "OUTPUT0"
    data_type: TYPE_STRING
    dims: [1]
  }
]

parameters: {
  key: "EXECUTION_ENV_PATH",
  value: {string_value: "<path to execution env>"}
}

instance_group [
  {
    count: 1
    kind: KIND_CPU
  }
]

Expected behavior Ideally the performance should be similar when the same model is being run with Triton

CoderHam commented 3 years ago

@SaratM34 were the same version of PyTorch used in both cases? The slowdown appears to be framework specific and not from inside Triton. cc @Tabrizian

SaratM34 commented 3 years ago

@CoderHam yes I am using the same version of PyTorch in both the cases which is 1.6.0. I have done lot of testing and debugging and seems like it is only happening when it is running with Triton particularly on CPU. Some pytorch operations are taking more time with Triton on CPU.

CoderHam commented 3 years ago

Is there a reason for not using the torch script model via the PyTorch Backend?

SaratM34 commented 3 years ago

The reason we are using python backend is because we have some pre-processing and post processing code that needs to be executed for each request and also the feature to use custom conda environments

Tabrizian commented 3 years ago

It is very strange that you're seeing different performances. Does both versions use the same installation method (i.e. conda)? It could be that some underlying library is different which is causing this performance regression.

SaratM34 commented 3 years ago

The FastAPI app uses python3.7. so I have created a custom python3.7 backend stub to use with Triton. Also created a custom conda environment with required dependencies and used the same environment with both Triton and FastAPI app. The difference in latencies is significant only incase of CPU. Below are the results when same profiling is done when both are running on GPU

Triton-GPU

Screen Shot 2021-09-22 at 11 23 14 AM

FastAPI-GPU

Screen Shot 2021-09-22 at 11 24 48 AM

tanmayv25 commented 2 years ago

@SaratM34 Can you share instructions to access the model and steps to follow so that we can reproduce the issue on our end? Did you observe the issue on any other models?

tanmayv25 commented 2 years ago

Looks similar to issue https://github.com/triton-inference-server/server/issues/2958.

SaratM34 commented 2 years ago

@tanmayv25 thanks for the link to a similar issue. I looked at the issue, in my case I don't have much numpy indexing. I only do once i.e when getting the input for the request pb_utils.get_input_tensor_by_name(request, "INPUT0").as_numpy()[0] .

Regarding the issue, I think it can be observed in python models that uses large roberta models for inferencing. One more thing I observed when the above model is being served with other models on CPU is that it is affecting the performance of other models too. So I added this torch.set_num_threads(1) to the code of the model we are having issue. Doing this improved overall performance when serving multiple models. Although the model is still slow.

tanmayv25 commented 2 years ago

Thanks a lot for the insight! As noted in the linked issue, I am aware of the slowdown when running torch within Triton python backend. This may be because of the contention between threads with Triton. Triton core doesn't create too many threads and they are likely not very CPU intensive. I will take a look at the CPU usage to get better insight.

Can you share your model.py and python script used for perf comparison? This will save me some time. I tried accessing roberta model from torch.hub but was seeing some HTTP:403 errors.

tanmayv25 commented 2 years ago

@SaratM34 Ok.. So I didn't try FastAPI. I just tried running roberta from hugginface as a general python script within the same environment as Triton but couldn't introduce the perf difference.

transformers : Version 4.12.5
torch : Version 1.8.2 
conda-pack : Version 0.6.0

Observations

Python Script

---------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                       Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls  
---------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
            model_inference        18.44%      10.525ms        99.95%      57.049ms      57.049ms             1  
               aten::linear         1.33%     757.000us        65.83%      37.572ms     514.685us            73  
               aten::matmul         3.13%       1.785ms        65.80%      37.557ms     391.219us            96  
                   aten::mm        55.38%      31.609ms        55.79%      31.844ms     442.278us            72  
           aten::contiguous         0.44%     251.000us         2.53%       1.442ms      30.042us            48  
           aten::layer_norm         0.26%     148.000us         2.37%       1.355ms      54.200us            25  
                 aten::gelu         2.13%       1.214ms         2.34%       1.336ms     111.333us            12  
                 aten::add_         2.16%       1.233ms         2.16%       1.233ms      16.890us            73  
    aten::native_layer_norm         1.84%       1.048ms         2.11%       1.207ms      48.280us            25  
         aten::_unsafe_view         1.51%     860.000us         1.88%       1.075ms      11.198us            96  
---------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 57.078ms

Triton Model.py

---------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                       Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls  
---------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
            model_inference        19.07%      11.139ms        99.92%      58.357ms      58.357ms             1  
               aten::linear         1.20%     702.000us        65.97%      38.526ms     527.753us            73  
               aten::matmul         2.61%       1.526ms        65.74%      38.393ms     399.927us            96  
                   aten::mm        56.70%      33.114ms        57.00%      33.291ms     462.375us            72  
           aten::layer_norm         0.25%     147.000us         2.28%       1.331ms      53.240us            25  
           aten::contiguous         0.41%     242.000us         2.20%       1.286ms      26.792us            48  
                 aten::add_         2.20%       1.284ms         2.20%       1.284ms      17.589us            73  
    aten::native_layer_norm         1.77%       1.033ms         2.03%       1.184ms      47.360us            25  
                    aten::t         1.04%     608.000us         1.83%       1.069ms      14.644us            73  
                 aten::view         1.78%       1.040ms         1.78%       1.040ms       3.250us           320  
---------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 58.402ms

Sources

Python Script

from transformers import RobertaTokenizer, RobertaModel
import torch
from torch.profiler import profile, record_function, ProfilerActivity

device = torch.device("cpu")
tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
model = RobertaModel.from_pretrained('roberta-base')

with torch.no_grad():
    inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
    import time
    t1 = time.time()
    with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA], record_shapes=True) as prof:
        with record_function("model_inference"):
            outputs = model(**inputs)

print('python script spends {}ms'.format((time.time()-t1)*1000))

last_hidden_states = outputs.last_hidden_state
print(last_hidden_states.size())
print(prof.key_averages().table(sort_by="cpu_time_total", row_limit=10))

Model.py

from transformers import RobertaTokenizer, RobertaModel
import numpy as np
import json
import torch
import triton_python_backend_utils as pb_utils
from torch.profiler import profile, record_function, ProfilerActivity

class TritonPythonModel:

    def initialize(self, args):
        self.model_config = model_config = json.loads(args['model_config'])

        output0_config = pb_utils.get_output_config_by_name(
            model_config, "OUTPUT0")

        self.output0_dtype = pb_utils.triton_string_to_numpy(
            output0_config['data_type'])
        #roberta = torch.hub.load('pytorch/fairseq', 'roberta.large.mnli')
        #roberta.eval()  # disable dropout for evaluation
        self.tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
        self.model = RobertaModel.from_pretrained('roberta-base')

    def execute(self, requests):
        """ This function is called on inference request.
        """

        output0_dtype = self.output0_dtype

        device = torch.device("cpu")
        responses = []
        for request in requests:
            in_0 = pb_utils.get_input_tensor_by_name(request, "INPUT0").as_numpy()

            with torch.no_grad():
                inputs = self.tokenizer(in_0[0].decode("utf-8"), return_tensors="pt")
                import time
                t1 = time.time()
                with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA], record_shapes=True) as prof:
                    with record_function("model_inference"):
                        outputs = self.model(**inputs)
                print('python script spends {}ms'.format((time.time()-t1)*1000))

            last_hidden_states = outputs.last_hidden_state.numpy()
            out_tensor_0 = pb_utils.Tensor("OUTPUT0",
                                           last_hidden_states)
            responses.append(
                pb_utils.InferenceResponse([out_tensor_0]))
        print(prof.key_averages().table(sort_by="cpu_time_total", row_limit=10))
        return responses

Model Config.pbtxt

name: "roberta"
backend: "python"

input [
  {
    name: "INPUT0"
    data_type: TYPE_STRING
    dims:  [1]

  }
]

output [
  {
    name: "OUTPUT0"
    data_type: TYPE_FP32
    dims: [1, 8, 768]
  }
]

instance_group [
  {
    count: 1
    kind: KIND_CPU
  }
]

parameters: {
  key: "EXECUTION_ENV_PATH",
  value: {string_value: "$$TRITON_MODEL_DIRECTORY/python-3-8.tar.gz"}
}

Input Data for Perf Analyzer

{
    "data" :
     [
        {
          "INPUT0" : ["Hello, my dog is cute"]
        }
     ]
}

Conclusion

As you can see the performance is almost identical for the cooked up scripts. What changes should I make to reproduce the issue? May be upgrading versions resolved the issue. I can try the script on FastAPI, but it is important that you share the the scripts where you saw the perf difference and instructions to reproduce it. It will be difficult to make any progress on this specific issue without it.

SaratM34 commented 2 years ago

@tanmayv25 Unfortunately, I couldn't share the actual model but I tried to reproduce the issue using a different model. It is not as slow as our model but with increase in request load the model is performing slower and slower. please find below the required files. you can download the model files from here: https://drive.google.com/drive/folders/1nzC2_GFh27mt8KP4dfGxewFP8BkEQEHH?usp=sharing

I built this model using the below notebook and saved the model state_dict and used it for inference. https://colab.research.google.com/github/DhavalTaunk08/NLP_scripts/blob/master/sentiment_analysis_using_roberta.ipynb

Triton Model.py

import numpy as np
import json
import triton_python_backend_utils as pb_utils

import torch

from transformers import RobertaModel, RobertaTokenizer

class RobertaClass(torch.nn.Module):
    def __init__(self):
        super(RobertaClass, self).__init__()
        self.l1 = RobertaModel.from_pretrained("roberta-base")
        self.pre_classifier = torch.nn.Linear(768, 768)
        self.dropout = torch.nn.Dropout(0.3)
        self.classifier = torch.nn.Linear(768, 5)

    def forward(self, input_ids, attention_mask, token_type_ids):
        output_1 = self.l1(input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids)
        hidden_state = output_1[0]
        pooler = hidden_state[:, 0]
        pooler = self.pre_classifier(pooler)
        pooler = torch.nn.ReLU()(pooler)
        pooler = self.dropout(pooler)
        output = self.classifier(pooler)
        return output

class TritonPythonModel:

    def initialize(self, args):

        # You must parse model_config. JSON string is not parsed here
        self.model_config = model_config = json.loads(args['model_config'])

        # Get OUTPUT0 configuration
        output0_config = pb_utils.get_output_config_by_name(
            model_config, "OUTPUT0")

        # Convert Triton types to numpy types
        self.output0_dtype = pb_utils.triton_string_to_numpy(
            output0_config['data_type'])
        self.model = RobertaClass()
        self.model.load_state_dict(torch.load('/models/roberta_test/1/files/pytorch_roberta_sentiment.bin', map_location=torch.device('cpu')))
        self.model.eval()
        self.tokenizer = RobertaTokenizer.from_pretrained('/models/roberta_test/1/files/', truncation=True, do_lower_case=True)

    def preprocess_data(self, sentence):
        inputs = self.tokenizer.encode_plus(
            sentence,
            None,
            add_special_tokens=True,
            max_length=256,
            pad_to_max_length=True,
            return_token_type_ids=True
        )

        ids = [inputs['input_ids']]
        mask = [inputs['attention_mask']]
        token_type_ids = inputs["token_type_ids"]
        data = {
            'ids': torch.tensor(ids, dtype=torch.long),
            'mask': torch.tensor(mask, dtype=torch.long),
            'token_type_ids': torch.tensor(token_type_ids, dtype=torch.long)
        }

        return data

    def execute(self, requests):

        output0_dtype = self.output0_dtype
        responses = []
        for request in requests:
            in_0 = pb_utils.get_input_tensor_by_name(request, "INPUT0")
            payload = json.loads(in_0.as_numpy()[0][0].decode("utf-8"))
            sentence = payload["data"]
            data = self.preprocess_data(sentence)
            with torch.no_grad():
                outputs = self.model(data['ids'], data['mask'], data['token_type_ids']).squeeze()
            result = torch.argmax(outputs).item()
            out_tensor_0 = pb_utils.Tensor("OUTPUT0",
                                           np.array(str(result), dtype='object').astype(output0_dtype))

            inference_response = pb_utils.InferenceResponse(
                output_tensors=[out_tensor_0])
            responses.append(inference_response)

        return responses

    def finalize(self):
        print('Cleaning up...')

Payload for Triton:

{
  "inputs": [
    {
      "name": "INPUT0",
      "shape": [ 1, 1 ],
      "datatype": "BYTES",
      "data": [
        ["{\"data\":\"A series of escapades demonstrating the adage that what is good for the goose\"}"]
      ]
    }
  ]
}

python app.py

import torch
from transformers import RobertaModel, RobertaTokenizer

class RobertaClass(torch.nn.Module):
    def __init__(self):
        super(RobertaClass, self).__init__()
        self.l1 = RobertaModel.from_pretrained("roberta-base")
        self.pre_classifier = torch.nn.Linear(768, 768)
        self.dropout = torch.nn.Dropout(0.3)
        self.classifier = torch.nn.Linear(768, 5)

    def forward(self, input_ids, attention_mask, token_type_ids):
        output_1 = self.l1(input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids)
        hidden_state = output_1[0]
        pooler = hidden_state[:, 0]
        pooler = self.pre_classifier(pooler)
        pooler = torch.nn.ReLU()(pooler)
        pooler = self.dropout(pooler)
        output = self.classifier(pooler)
        return output

class SentimentModel:

    def __init__(self):
        self.model = RobertaClass()
        self.model.load_state_dict(torch.load('files/pytorch_roberta_sentiment.bin', map_location=torch.device('cpu')))
        self.model.eval()
        self.tokenizer = RobertaTokenizer.from_pretrained('files', truncation=True, do_lower_case=True)

    def predict(self, request):
        sentence = request["data"]
        data = self.preprocess_data(sentence)
        with torch.no_grad():
            outputs = self.model(data['ids'], data['mask'], data['token_type_ids']).squeeze()
            result = torch.argmax(outputs).item()

        return {"result": str(result)}

    def preprocess_data(self, sentence):
        inputs = self.tokenizer.encode_plus(
            sentence,
            None,
            add_special_tokens=True,
            max_length=256,
            pad_to_max_length=True,
            return_token_type_ids=True
        )

        ids = [inputs['input_ids']]
        mask = [inputs['attention_mask']]
        token_type_ids = inputs["token_type_ids"]
        data = {
            'ids': torch.tensor(ids, dtype=torch.long),
            'mask': torch.tensor(mask, dtype=torch.long),
            'token_type_ids': torch.tensor(token_type_ids, dtype=torch.long)
        }

        return data

if __name__ == '__main__':
    sm = SentimentModel()
    data = {"data": "A series of escapades demonstrating the adage that what is good for the goose is also good for the gander , some of which occasionally amuses but none of which amounts to much of a story ."}
    print(sm.predict(data))
tanmayv25 commented 2 years ago

I tried these models and still couldn't reproduce the issue. Are you using perf_analyzer to measure the latencies in triton?

Are you using HTTP or gRPC endpoint to benchmark? If not using perf_analyzer and using HTTP endpoint, are you using binary tensor data extension?

If not using the binary data extension then some CPU cycles may be expended in decoding tensor data from json impacting the overall performance..

SaratM34 commented 2 years ago

@tanmayv25 I am using HTTP endpoint for the benchmark and using Jmeter to measure latencies. I am not using the binary tensor data extension since the python model expects string as input.

tanmayv25 commented 2 years ago

@SaratM34 Can you try with binary data extension? It should help freeing up some CPU cycles. The request should look like:

{
  "inputs": [
    {
      "name": "INPUT0",
      "shape": [ 1, 1 ],
      "datatype": "BYTES",
      "parameters" : {
        "binary_data_size" : <data byte_size>
      }
    }
  ]
}
<first 4bytes holding the length of the string><string content>

You should also set Inference-Header-Content-Length to the size of the json section, excluding the raw data bytes.

With perf_analyzer(which uses binary data extension in HTTP), I am seeing same performance as with the python script. I have a 12 core CPU (Intel(R) Core(TM) i7-7800X CPU @ 3.50GHz).

SaratM34 commented 2 years ago

@tanmayv25 I can try the binary data extension. May I know if the below request is correct If i am using this string "triton test request". Also may I know what would be the Inference-Header-Content-Length having difficulty understanding what to include in calculating size

{
  "inputs": [
    {
      "name": "INPUT0",
      "shape": [ 1, 1 ],
      "datatype": "BYTES",
      "parameters" : {
        "binary_data_size" : 19
      }
    }
  ]
}
00000000000000000000000000010011triton test request
tanmayv25 commented 2 years ago

It is described here: https://github.com/triton-inference-server/server/issues/2478#issuecomment-772151437.

I am not sure whether the bytes will look like above. For string serialization, you can take a hint from here: https://github.com/triton-inference-server/client/blob/87255faf0e9769b55a1282b5ac32820e66ee9326/src/python/library/tritonclient/utils/__init__.py#L187

And how we build the request body can be seen from here: https://github.com/triton-inference-server/client/blob/main/src/python/library/tritonclient/http/__init__.py#L81-L128

The Inference-Header-Content-Length will be set to json_size from above function. We have similar implementation in our C++ client library.

tanmayv25 commented 2 years ago

So, I performed load testing using jmeter. I am unable to still reproduce the issue with jmeter not using binary data extension. The latencies collected:

Number of requests ~= 1000
Number of clients = 1
Setup | Mean Latency | Max Latency | Min Latency | p99 | p95 | p90 | p50 -- | -- | -- | -- | -- | -- | -- | -- Triton(jmeter) | 207.1746362 | 298 | 146 | 268.2 | 255 | 241 | 206.7458506 PythonScript | 203.69 | 273.08 | 152.14 | 258.28 | 243.09 | 234.6 | 203.11

I tried to use jmeter with binary data extension. I used the Beanshell Preprocessor to run the script like:

import java.nio.charset.StandardCharsets;

try {
    ByteArrayOutputStream bodyBytes = new ByteArrayOutputStream();
    byte[] jsonBytes = "{\"inputs\":[{\"name\": \"INPUT0\", \"shape\": [ 1, 1 ], \"datatype\": \"BYTES\",\"parameters\":{\"binary_data_size\" : 98}}]}".getBytes(StandardCharsets.UTF_8);
    bodyBytes.write(jsonBytes);
    byte[] stringBytes = "{\"utterances\":\"A series of escapades demonstrating the adage that what is good for the goose\"}".getBytes(StandardCharsets.UTF_8);
    int string_length = stringBytes.length;
    log.info("JSONsize="+jsonBytes.length);
    log.info("Stringsize="+string_length);
    bodyBytes.write(0);
    bodyBytes.write(0);
    bodyBytes.write(0);
    bodyBytes.write(string_length);
    bodyBytes.write(stringBytes);
    vars.put("mybody",bodyBytes.toString(StandardCharsets.UTF_8));
    log.info("request body="+bodyBytes.toString(StandardCharsets.UTF_8));
}
catch (Throwable e) {
    log.error("Errror in Beanshell", e);
    throw e;
}

and the HTTP request body as

${__V(mybody)}

I used HTTP Header manager to set Inference-Header-Content-Length to 110. However, the bytes received on the server side seems to be corrupted. I suspect the .toString() call on the bytes. Hopefully, you may find it interesting while using jmeter with Binary data extension.

That being said, even without binary data extension, I am unable to reproduce the issue on a 12-core ntel(R) Core(TM) i7-7800X CPU @ 3.50GHz machine.

SaratM34 commented 2 years ago

@tanmayv25 I see that you are using request concurrency (num clients) as 1. The difference in latencies is not much when concurrency=1 but if you increase the concurrency the difference becomes higher and higher. for example, try these concurrencies 1, 5 ,10, 15

My Jmeter Results below: Triton:

Screen Shot 2021-11-23 at 3 13 24 PM

Python App:

Screen Shot 2021-11-23 at 3 14 44 PM
tanmayv25 commented 2 years ago

Ok.. Let me run with concurrency of 10 and share that with you.

tanmayv25 commented 2 years ago

I ran perf_analyzer to span the concurrencies 5, 10, 15, 20 and 25. Note the compute infer value in the logs. It represents how much time python backend took executing the model.py execute function. As it can be seen, the average perf numbers for compute_infer remains the same.

perf_analyzer -m roberta_test --concurrency-range 5:25:5 --input-data roberta_input_data.json -p 100000
 Successfully read data for 1 stream/streams with 1 step/steps.
*** Measurement Settings ***
  Batch size: 1
  Using "time_windows" mode for stabilization
  Measurement window: 100000 msec
  Latency limit: 0 msec
  Concurrency limit: 25 concurrent requests
  Using synchronous calls for inference
  Stabilizing using average latency

Request concurrency: 5
  Client: 
    Request count: 480
    Throughput: 4.8 infer/sec
    Avg latency: 1041343 usec (standard deviation 78806 usec)
    p50 latency: 1038750 usec
    p90 latency: 1139221 usec
    p95 latency: 1167004 usec
    p99 latency: 1202143 usec
    Avg HTTP time: 1048256 usec (send/recv 75 usec + response wait 1048181 usec)
  Server: 
    Inference count: 573
    Execution count: 573
    Successful request count: 573
    Avg request latency: 1047931 usec (overhead 4 usec + queue 838434 usec + compute input 12 usec + compute infer 209423 usec + compute output 58 usec)

Request concurrency: 10
  Client: 
    Request count: 484
    Throughput: 4.84 infer/sec
    Avg latency: 2066066 usec (standard deviation 114997 usec)
    p50 latency: 2075424 usec
    p90 latency: 2197307 usec
    p95 latency: 2237271 usec
    p99 latency: 2286741 usec
    Avg HTTP time: 2078540 usec (send/recv 75 usec + response wait 2078465 usec)
  Server: 
    Inference count: 578
    Execution count: 578
    Successful request count: 578
    Avg request latency: 2078213 usec (overhead 3 usec + queue 1870333 usec + compute input 12 usec + compute infer 207807 usec + compute output 58 usec)

Request concurrency: 15
  Client: 
    Request count: 479
    Throughput: 4.79 infer/sec
    Avg latency: 3133161 usec (standard deviation 170690 usec)
    p50 latency: 3153870 usec
    p90 latency: 3328618 usec
    p95 latency: 3388266 usec
    p99 latency: 3456496 usec
    Avg HTTP time: 3126585 usec (send/recv 76 usec + response wait 3126509 usec)
  Server: 
    Inference count: 575
    Execution count: 575
    Successful request count: 575
    Avg request latency: 3126256 usec (overhead 3 usec + queue 2917797 usec + compute input 12 usec + compute infer 208386 usec + compute output 58 usec)

Request concurrency: 20
  Client: 
    Request count: 474
    Throughput: 4.74 infer/sec
    Avg latency: 4214520 usec (standard deviation 0 usec)
    p50 latency: 4237129 usec
    p90 latency: 4402049 usec
    p95 latency: 4438031 usec
    p99 latency: 4519299 usec
    Avg HTTP time: 4190670 usec (send/recv 76 usec + response wait 4190594 usec)
  Server: 
    Inference count: 573
    Execution count: 573
    Successful request count: 573
    Avg request latency: 4190320 usec (overhead 4 usec + queue 3980957 usec + compute input 12 usec + compute infer 209289 usec + compute output 58 usec)

Request concurrency: 25
  Client: 
    Request count: 486
    Throughput: 4.86 infer/sec
    Avg latency: 5140053 usec (standard deviation 200182 usec)
    p50 latency: 5183779 usec
    p90 latency: 5359216 usec
    p95 latency: 5411205 usec
    p99 latency: 5483123 usec
    Avg HTTP time: 5153659 usec (send/recv 74 usec + response wait 5153585 usec)
  Server: 
    Inference count: 582
    Execution count: 582
    Successful request count: 582
    Avg request latency: 5153336 usec (overhead 3 usec + queue 4947087 usec + compute input 12 usec + compute infer 206176 usec + compute output 58 usec)

Inferences/Second vs. Client Average Batch Latency
Concurrency: 5, throughput: 4.8 infer/sec, latency 1041343 usec
Concurrency: 10, throughput: 4.84 infer/sec, latency 2066066 usec
Concurrency: 15, throughput: 4.79 infer/sec, latency 3133161 usec
Concurrency: 20, throughput: 4.74 infer/sec, latency 4214520 usec
Concurrency: 25, throughput: 4.86 infer/sec, latency 5140053 usec

I also ran load testing on triton with Number of Threads(client) set to 25.

$jmeter -t /tmp/host/triton_orig.jmx -n -l /tmp/host/loading_25.csv  
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by com.thoughtworks.xstream.core.util.Fields (file:/usr/share/java/xstream.jar) to field java.util.TreeMap.comparator
WARNING: Please consider reporting this to the maintainers of com.thoughtworks.xstream.core.util.Fields
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
Security framework of XStream not explicitly initialized, using predefined black list on your own risk.
Creating summariser <summary>
Created the tree successfully using /tmp/host/triton_orig.jmx
Starting the test @ Wed Nov 24 02:17:34 UTC 2021 (1637720254995)
Waiting for possible shutdown message on port 4445
summary +      1 in     1s =    1.9/s Avg:   243 Min:   243 Max:   243 Err:     0 (0.00%) Active: 13 Started: 13 Finished: 0
summary +    124 in    25s =    5.0/s Avg:  4475 Min:   440 Max:  5346 Err:     0 (0.00%) Active: 25 Started: 25 Finished: 0
summary =    125 in  25.1s =    5.0/s Avg:  4441 Min:   243 Max:  5346 Err:     0 (0.00%)
summary +    144 in    30s =    4.8/s Avg:  5167 Min:  4697 Max:  5448 Err:     0 (0.00%) Active: 25 Started: 25 Finished: 0
summary =    269 in  55.1s =    4.9/s Avg:  4830 Min:   243 Max:  5448 Err:     0 (0.00%)
summary +    143 in    30s =    4.8/s Avg:  5215 Min:  4755 Max:  5472 Err:     0 (0.00%) Active: 25 Started: 25 Finished: 0
summary =    412 in    85s =    4.8/s Avg:  4963 Min:   243 Max:  5472 Err:     0 (0.00%)
summary +    101 in  20.4s =    4.9/s Avg:  5042 Min:  4508 Max:  5495 Err:     0 (0.00%) Active: 0 Started: 25 Finished: 25
summary =    513 in   105s =    4.9/s Avg:  4979 Min:   243 Max:  5495 Err:     0 (0.00%)
Tidying up ...    @ Wed Nov 24 02:19:20 UTC 2021 (1637720360419)
... end of run
Setup | Mean Latency | Max Latency | Min Latency | p99 | p95 | p90 | p50 -- | -- | -- | -- | -- | -- | -- | -- Triton(jmeter 25 clients) | 5111.973742 | 5495 | 4179 | 5463.2 | 5425.6 | 5395.2 | 5100.813319

As you can see, at a concurrency level of 25, the perf_analyzer(using binary data extension) and jmeter(not using binary data extension) gives mean end-to-end latency as 5140ms and 5111.97ms. They are almost the same.

But as we can see the python backend is taking identical time (208ms), the issue very likely appears to be outside the scope of python backend's model.py. The performance numbers I collected on Triton are actually very similar to your python app. I have summarized Triton jmeter with concurrency of 1, 5 and 10 below:

Number of Threads | Mean Latency | Max Latency | Min Latency | p99 | p95 | p90 | p50 -- | -- | -- | -- | -- | -- | -- | -- 1 | 207.1746362 | 298 | 146 | 268.2 | 255 | 241 | 206.7458506 5 | 1051.068132 | 1216 | 446 | 1172.92 | 1154.3 | 1132.8 | 1048.764254 10 | 1890.920705 | 2184 | 1710 | 2147.82 | 2076 | 2027.4 | 1886.765934

Are you using jmeter in non-gui mode? I read that jmeter in GUI can consume significant memory and cpu resources which may severely impact Triton. It is possible that FastAPI implementation in itself is less CPU intensive and not as impacted as Triton. I am running the jmeter cli in the triton container itself.

SaratM34 commented 2 years ago

@tanmayv25 Thank you for running the tests and sharing the results. For my testing, I used Jmeter in non-gui mode. Also, the tests are run in an instance that is separate from where actually the Triton and FastAPI app are running. So it shouldn't affect the performance of Triton. Let me re-run the tests from my end and I will share the results and also share the fastapi script.

tanmayv25 commented 2 years ago

@SaratM34 From what I remember with our latest communication, the perf was slower only when the server had idle models loaded alongside. Can you try python backend with Iman's fix #112? This should help improving the performance and probably you would not need to set the max thread count to 1.

SaratM34 commented 2 years ago

@tanmayv25 Thanks for the update. I am currently building the tritonserver with --backend=python:main. Once it's built, I will test it out and update here.

SaratM34 commented 2 years ago

@tanmayv25 In my initial testing the results looks good. The performance is greatly improved. Below is some summary from initial testing

Before Fix

Inferences/Second vs. Client Average Batch Latency
Concurrency: 5, throughput: 3.05 infer/sec, latency 1655480 usec
Concurrency: 10, throughput: 3.17 infer/sec, latency 3153540 usec
Concurrency: 15, throughput: 3.21 infer/sec, latency 4687196 usec

After Fix

Inferences/Second vs. Client Average Batch Latency
Concurrency: 5, throughput: 17.72 infer/sec, latency 282101 usec
Concurrency: 10, throughput: 17.75 infer/sec, latency 562845 usec
Concurrency: 15, throughput: 17.95 infer/sec, latency 836044 usec

I have some more testing pending. I will update here Once I am done with the complete testing.

tanmayv25 commented 2 years ago

Thanks for sharing the results. Please close the issue once you are done with your pending tests and are completely satisfied.

zhaohb commented 2 years ago

@SaratM34 @tanmayv25 I had rebuild the tritonserver with --backend=python:22.01, then tested python backend and unfortunately found no performance improvement. what is the problem?

This is the result of top:

image

I think python backend should be ok.

tanmayv25 commented 2 years ago

The reported slowdown manifests when there are multiple cpu-intensive models loaded in triton. The other idle models interfere with the performance of the target model because the python backend thread was not sleeping properly for these idle models hogging up the cpu. @zhaohb You will not see the speed up if you have only one python model or if your model is not cpu-intensive enough. I am closing the ticket as @SaratM34 confirmed that he can see higher performance now. Please re-open or file a new bug if you have any other concerns.

zhaohb commented 2 years ago

@tanmayv25 ok, thank you very much.