We are running tritonclient[http]=2.41.0 with server running 24.06-py3. When there are O(600) requests reaching the server we intermittently receive the following error from triton:
Traceback (most recent call last):
File "src/gevent/greenlet.py", line 908, in gevent._gevent_cgreenlet.Greenlet.run
…
ssl.SSLEOFError: EOF occurred in violation of protocol (_ssl.c:2384)
2024-08-08T18:43:45Z <Greenlet at 0x154ae0af4680: wrapped_post('v2/models/tglauch_classifier/infer', b'{"inputs":[{"name":"Input-Branch1","shape":[250,, None, None)> failed with SSLEOFError
…
result = async_requests[0].get_result()
…
ssl.SSLEOFError: EOF occurred in violation of protocol (_ssl.c:2384)
The code we use is:
async_requests = []
input = [
httpclient.InferInput(self.model_input_name,
input_data.shape,
self.model_input_type) ]
output = [
httpclient.InferRequestedOutput(
self.model_output_name) ]
input[0].set_data_from_numpy(
input_data.astype(np.single),
binary_data=False)
async_requests.append(
self.triton_client.async_infer(
model_name=self.nn_model_name,
inputs=input,
outputs=output))
if len(async_requests) == 1:
for attempt in range(10):
try:
time.sleep(attempt*10.)
result = async_requests[0].get_result()
except Exception as e:
log_warn("Exception {}".format(e.message()))
else:
break
else:
log_fatal("can't get results")
output_data = result.as_numpy(self.model_output_name)
return output_data
else:
outputs = []
for async_request in async_requests:
result = async_request.get_result()
outputs.append(output_data)
return np.array(outputs)
where the client is defined as follows
self.triton_client = httpclient.InferenceServerClient(
url=self.triton_server_uri,
verbose=self.triton_verbose,
concurrency=self.request_limit,
ssl=True,
# We will want to make this connection secure in the
# long-run, but right now it will work as it
# reports a self-signed cert
insecure=True,
ssl_context_factory=gevent.ssl._create_unverified_context
)
We tried changing the batch size and that improves things, but even with a batch size of 1 we get the error.
Looking at the nv_inference_request_duration_us we see that it shots > 60 seconds, the GPU utilization does spike to 100%, and the GPU memory hovers around 43 GB (NVIDIA L40). We have the server configured to allow for 4 instances of the inference per GPU.
Either this seems to be an issue with the k8s autoscaler not spawing replicas at all or that the await in async_infer is not really awaiting.
We are running
tritonclient[http]=2.41.0
with server running24.06-py3
. When there are O(600) requests reaching the server we intermittently receive the following error from triton:The code we use is:
where the client is defined as follows
We tried changing the batch size and that improves things, but even with a batch size of 1 we get the error.
Looking at the
nv_inference_request_duration_us
we see that it shots > 60 seconds, the GPU utilization does spike to 100%, and the GPU memory hovers around 43 GB (NVIDIA L40). We have the server configured to allow for 4 instances of the inference per GPU.Either this seems to be an issue with the k8s autoscaler not spawing replicas at all or that the
await
inasync_infer
is not really awaiting.Any suggestions how to improve things?