triton-inference-server / client

Triton Python, C++ and Java client libraries, and GRPC-generated client examples for go, java and scala.
BSD 3-Clause "New" or "Revised" License
527 stars 225 forks source link

Optimize the request body creation in HTTP python clients #620

Closed krishung5 closed 2 months ago

krishung5 commented 2 months ago

This PR enhances performance by replacing the use of struct.pack with list operations. For more detailed numbers, please refer to the comments in Jira ticket DLIS-6498.

The improvement was observed in various scenarios:

identity model(one input, one output)

Considering that the number of inputs can impact performance, as it requires more work for both struct.pack and list operations.

python model(three inputs, three outputs)

In addition to the experiments mentioned above, simply running the Python code without Triton also indicates that the new code achieves better throughput. As shown below:

root@a2826b5-lcedt:/client/test# python3 test.py 
Original code execution time: 6.882894917973317
New code execution time: 3.7262046359828673

Observed improvement of approximately 45.91% in execution time with three inputs.

test.py

import struct
import timeit

# Test data
# request_json = '{"inputs":[{"name":"INPUT0","shape":[1,1048576],"datatype":"INT32","parameters":{"binary_data_size":4194304}}],"outputs":[{"name":"OUTPUT0","parameters":{"binary_data":true}}]}'
request_json = '''
{
  "inputs": [
    {"name": "INPUT0", "shape": [1, 1048576], "datatype": "INT32", "parameters": {"binary_data_size": 4194304}},
    {"name": "INPUT1", "shape": [1, 1048576], "datatype": "INT32", "parameters": {"binary_data_size": 4194304}},
    {"name": "INPUT2", "shape": [1, 1048576], "datatype": "INT32", "parameters": {"binary_data_size": 4194304}}
  ],
  "outputs": [
    {"name": "OUTPUT0", "parameters": {"binary_data": true}},
    {"name": "OUTPUT1", "parameters": {"binary_data": true}},
    {"name": "OUTPUT2", "parameters": {"binary_data": true}}
  ]
}
'''

json_size = len(request_json)
num_inputs = 3

def get_binary_data():
    # Simulating larger binary data
    return b"\x01" * (1048576 * 10)  # 10 MB of binary data

# Option 1 - original
def option_1():
    binary_data = None
    for input_tensor in range(num_inputs):
        raw_data = get_binary_data()
        if raw_data is not None:
            if binary_data is not None:
                binary_data += raw_data
            else:
                binary_data = raw_data
    if binary_data is not None:
        request_body = struct.pack(
            "{}s{}s".format(len(request_json), len(binary_data)),
            request_json.encode(),
            binary_data,
        )
        return request_body, json_size

    return request_body.encode(), None

# Option 2 - new patch
def option_2():
    request_body = [request_json.encode()]
    for input_tensor in range(num_inputs):
        raw_data = get_binary_data()
        if raw_data is not None:
            request_body.append(raw_data)
    if len(request_body) == 1:
        # The request body constitutes the whole request
        return request_body[0], None
    else:
        return b"".join(request_body), json_size

# Time option 1
time_option_1 = timeit.timeit("option_1()", globals=globals(), number=100)

# Time option 2
time_option_2 = timeit.timeit("option_2()", globals=globals(), number=100)

# Print the results
print("Original code execution time:", time_option_1)
print("New code execution time:", time_option_2)
rmccorm4 commented 2 months ago

Can you add a summary of the changes / motivation to the description? Was there an observed X % improvement by doing this? Is it only applicable to certain scenarios? Was struct.pack slower than json.dumps? etc

tanmayv25 commented 2 months ago

I believe the performance numbers are shared as a part of DLIS-6498

krishung5 commented 2 months ago

@rmccorm4 Updated the description with the summary. For more detailed numbers, please see the ticket. Thanks!