triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
8.4k stars 1.49k forks source link

SSLEOFError when result from async_infer is not available in http client #7515

Open briedel opened 3 months ago

briedel commented 3 months ago

We are running tritonclient[http]=2.41.0 with server running 24.06-py3. When there are O(600) requests reaching the server we intermittently receive the following error from triton:

Traceback (most recent call last):
  File "src/gevent/greenlet.py", line 908, in gevent._gevent_cgreenlet.Greenlet.run
…
ssl.SSLEOFError: EOF occurred in violation of protocol (_ssl.c:2384)
2024-08-08T18:43:45Z <Greenlet at 0x154ae0af4680: wrapped_post('v2/models/tglauch_classifier/infer', b'{"inputs":[{"name":"Input-Branch1","shape":[250,, None, None)> failed with SSLEOFError

…
    result = async_requests[0].get_result()
…
ssl.SSLEOFError: EOF occurred in violation of protocol (_ssl.c:2384)

The code we use is:


        async_requests = []

        input = [
                httpclient.InferInput(self.model_input_name,
                                        input_data.shape,
                                        self.model_input_type) ]
        output = [
                httpclient.InferRequestedOutput(
                    self.model_output_name) ]

        input[0].set_data_from_numpy(
            input_data.astype(np.single), 
            binary_data=False)

        async_requests.append(
                self.triton_client.async_infer(
                    model_name=self.nn_model_name,
                    inputs=input,
                    outputs=output))

        if len(async_requests) == 1:
            for attempt in range(10):
                try:
                    time.sleep(attempt*10.)
                    result = async_requests[0].get_result()
                except Exception as e:
                    log_warn("Exception {}".format(e.message()))
                else:
                    break
            else:
                log_fatal("can't get results")     
            output_data = result.as_numpy(self.model_output_name)
            return output_data
        else:
            outputs = []
            for async_request in async_requests:
                result = async_request.get_result()
                outputs.append(output_data)
            return np.array(outputs)

where the client is defined as follows


            self.triton_client = httpclient.InferenceServerClient(
                url=self.triton_server_uri, 
                verbose=self.triton_verbose,
                concurrency=self.request_limit, 
                ssl=True,
                # We will want to make this connection secure in the 
                # long-run, but right now it will work as it 
                # reports a self-signed cert
                insecure=True,
                ssl_context_factory=gevent.ssl._create_unverified_context
                )

We tried changing the batch size and that improves things, but even with a batch size of 1 we get the error.

Looking at the nv_inference_request_duration_us we see that it shots > 60 seconds, the GPU utilization does spike to 100%, and the GPU memory hovers around 43 GB (NVIDIA L40). We have the server configured to allow for 4 instances of the inference per GPU.

Either this seems to be an issue with the k8s autoscaler not spawing replicas at all or that the await in async_infer is not really awaiting.

Any suggestions how to improve things?