triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
8.04k stars 1.44k forks source link

tritonserver HTTPServer does not detect EOF/connection close unless response is streaming #7077

Open pathorn opened 5 months ago

pathorn commented 5 months ago

Description When a user performs a long-running inference request via HTTPServer, they may lose connection or intentionally abort the connection (ctrl-c from curl). Ideally, the HTTP server will detect this and flag the request as cancelled (such as setting Request::is_cancelled_).

However, at least on Linux, we found that the HTTPServer does not call triton::server::HTTPAPIServer::InferRequestClass::RequestFiniHook until something (such as a streaming response) attempts to write to the connection.

Triton Information v2.43.00 compiled with ./build.py --enable-all --enable-nvtx --enable-gpu --quiet --no-container-interactive --build-type=Debug Reproduced on Ubuntu 24.04 on NVIDIA GeForce RTX 2070 and also H100.

Are you using the Triton container or did you build it yourself? Reproduced both on tensorrtllm_backend container and on our own build.

To Reproduce Steps to reproduce the behavior.

Describe the models (framework, inputs, outputs), ideally include the model configuration file (if using an ensemble include the model configuration file for that as well).

I am using the example from tensorrtllm_backend/all_models/inflight_batcher_llm example with the TinyLlama/TinyLlama-1.1B-Chat-v0.1 model using max_batch_size=2 decoupled_mode=true and accumulate_tokens=true

$ diff -ur tensorrtllm_backend/all_models/inflight_batcher_llm/tensorrt_llm_bls/  /data/tinyllama_backend/tensorrt_llm_bls/
diff -ur tensorrtllm_backend/all_models/inflight_batcher_llm/tensorrt_llm_bls/1/model.py /data/tinyllama_backend/tensorrt_llm_bls/1/model.py
--- tensorrtllm_backend/all_models/inflight_batcher_llm/tensorrt_llm_bls/1/model.py 2024-03-25 15:34:08.381629507 -0700
+++ /data/tinyllama_backend/tensorrt_llm_bls/1/model.py 2024-04-05 16:16:30.963840112 -0700
@@ -310,6 +310,7 @@

                 #Loop over the trtllm responses
                 for trtllm_response in trtllm_responses:
+                    print("Cancelled: " + str(request.is_cancelled()))

                     if trtllm_response.has_error():
                         raise pb_utils.TritonModelException(
diff -ur tensorrtllm_backend/all_models/inflight_batcher_llm/tensorrt_llm_bls/config.pbtxt /data/tinyllama_backend/tensorrt_llm_bls/config.pbtxt
--- tensorrtllm_backend/all_models/inflight_batcher_llm/tensorrt_llm_bls/config.pbtxt   2024-03-25 15:34:08.381629507 -0700
+++ /data/tinyllama_backend/tensorrt_llm_bls/config.pbtxt   2024-03-29 18:39:43.838353004 -0700
@@ -26,10 +26,10 @@

 name: "tensorrt_llm_bls"
 backend: "python"
-max_batch_size: ${triton_max_batch_size}
+max_batch_size: 2

 model_transaction_policy {
-  decoupled: ${decoupled_mode}
+  decoupled: true
 }

 input [
@@ -209,13 +209,13 @@
 parameters: {
   key: "accumulate_tokens"
   value: {
-    string_value: "${accumulate_tokens}"
+    string_value: "true"
   }
 }

 instance_group [
   {
-    count: ${bls_instance_count}
+    count: 2
     kind : KIND_CPU
   }
 ]

To reproduce, send a generate request to the tensorrt_llm_bls, both with and without stream, and compare what happens when interrupting curl.

curl -d  '{"text_input": "Hello\n", "max_tokens": 900, "bad_words": "", "stop_words":[],"stream":false,"stop":true}' http://localhost:8000/v2/models/tensorrt_llm_bls/generate; echo
^C

The server will wait until all 900 tokens have been generated, then will print "Cancelled: False" to the log. This is incorrect.

To compare, generate_stream behaves correctly, I believe due to libevent writing to the socket

curl -d  '{"text_input": "Hello\n", "max_tokens": 900, "bad_words": "", "stop_words":[],"stream":true,"stop":true}' http://localhost:8000/v2/models/tensorrt_llm_bls/generate_stream; echo
^C

When streaming, the server prints cancelled: True after the ctrl-C

Cancelled: False
Cancelled: False
Cancelled: True
Cancelled: True

Expected behavior As soon as the HTTP connection is closed, libevent should be able to inform evhtp about the EOF event, and it IsCancelled should return true.

pathorn commented 5 months ago

In case it helps, we attempted to isolate the time when is_cancelled_ is set to true. When hitting ctrl-C while streaming, evhtp seems to hit the htp__connection_eventcb_ callback with event flags=17 (READ | EOF):

Thread 172 "tritonserver" hit Hardware watchpoint 2: *0x7ffe48013640

Old value = 1886352384
New value = 1886352385
std::__atomic_base<bool>::store (__m=std::memory_order_seq_cst, __i=true, this=0x7ffe48013640) at /usr/include/c++/11/bits/atomic_base.h:465
warning: 465    /usr/include/c++/11/bits/atomic_base.h: No such file or directory
(gdb) bt
#0  std::__atomic_base<bool>::store (__m=std::memory_order_seq_cst, __i=true, this=0x7ffe48013640)
    at /usr/include/c++/11/bits/atomic_base.h:465
#1  std::__atomic_base<bool>::operator= (this=0x7ffe48013640, __i=true) at /usr/include/c++/11/bits/atomic_base.h:356
#2  0x00007ffff5bc0aa5 in std::atomic<bool>::operator= (this=0x7ffe48013640, __i=true) at /usr/include/c++/11/atomic:80
#3  0x00007ffff5bc0f67 in triton::core::InferenceResponseFactory::Cancel (this=0x7ffe480135d0)
    at /tmp/tritonbuild/tritonserver/build/_deps/repo-core-src/src/infer_response.h:67
#4  0x00007ffff5bc12fa in triton::core::InferenceRequest::Cancel (this=0x7ffe48015590)
    at /tmp/tritonbuild/tritonserver/build/_deps/repo-core-src/src/infer_request.h:708
#5  0x00007ffff5dd004d in TRITONSERVER_InferenceRequestCancel (inference_request=0x7ffe48015590)
    at /tmp/tritonbuild/tritonserver/build/_deps/repo-core-src/src/tritonserver.cc:1695
#6  0x00005555558987fe in triton::server::HTTPAPIServer::InferRequestClass::RequestFiniHook (request=0x7ffe48001fc0, arg=0x7ffe48015aa0)
    at /workspace/src/http_server.cc:3676
#7  0x0000555556060626 in htp__hook_request_fini_ (request=0x7ffe48001fc0)
    at /tmp/tritonbuild/tritonserver/build/_deps/repo-third-party-src/libevhtp/libevhtp/evhtp.c:672
#8  0x00005555560614c7 in htp__request_free_ (request=0x7ffe48001fc0)
    at /tmp/tritonbuild/tritonserver/build/_deps/repo-third-party-src/libevhtp/libevhtp/evhtp.c:1239
#9  0x00005555560682cc in evhtp_connection_free (connection=0x7ffe7c000e20)
    at /tmp/tritonbuild/tritonserver/build/_deps/repo-third-party-src/libevhtp/libevhtp/evhtp.c:5151
#10 0x0000555556064179 in htp__connection_eventcb_ (bev=0x7ffe48001790, events=17, arg=0x7ffe7c000e20)
    at /tmp/tritonbuild/tritonserver/build/_deps/repo-third-party-src/libevhtp/libevhtp/evhtp.c:2615
#11 0x0000555556077410 in bufferevent_run_deferred_callbacks_locked (cb=0x7ffe48001930, arg=0x7ffe48001790)
    at /tmp/tritonbuild/tritonserver/build/_deps/repo-third-party-build/libevent/src/libevent/bufferevent.c:161
#12 0x0000555556082079 in event_process_active_single_queue (base=0x7ffe48000e60, activeq=0x7ffe480012b0, max_to_process=2147483647, 
    endtime=0x0) at /tmp/tritonbuild/tritonserver/build/_deps/repo-third-party-build/libevent/src/libevent/event.c:1652
#13 0x00005555560825d7 in event_process_active (base=0x7ffe48000e60)
    at /tmp/tritonbuild/tritonserver/build/_deps/repo-third-party-build/libevent/src/libevent/event.c:1738
#14 0x0000555556082da7 in event_base_loop (base=0x7ffe48000e60, flags=0)
    at /tmp/tritonbuild/tritonserver/build/_deps/repo-third-party-build/libevent/src/libevent/event.c:1961
#15 0x000055555606df86 in _evthr_loop (args=0x5555571cba00)
    at /tmp/tritonbuild/tritonserver/build/_deps/repo-third-party-src/libevhtp/libevhtp/thread.c:139
#16 0x00007ffff4c9ca94 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:447
#17 0x00007ffff4d29c2c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78

We set a breakpoint in htp__connection_eventcb_ as well as event_persist_closure and neither of these callbacks trigger unless the socket is being written (during streaming, or when generate is finalized).

nnshah1 commented 5 months ago

Thanks for reporting / requesting and the detailed analysis. Request cancellation is best effort and no guarantees are made due to the multiple components (frontend / core / backend) involved. That being said - we can also investigate on our side. From your investigation - it wasn't clear to me if you'd found a means of detecting the connection close event before a subsequent write or you weren't able to find a means. If you have found a way - please do consider submitting a PR

If the event is only triggered on a write then this may be an unfortunate asymmetry between streaming and non streaming - as the backend handles that detail internally and only sends a response to the core / frontend for intermediate responses (streaming) or all at once (non streaming). Will require some digging to see if there is a different hook that could be used.

We are in the process of adding a few additional hooks to also enable a cleaner http shutdown sequence (wait for existing connections to close before shutting down the server) - this may be triggered earlier and provide a means to cancel ...

@kthui - after you merge in your changes for connection close events - we can test if this would catch the scenario above

kthui commented 5 months ago

Sure, added a ticket for testing the scenario above.

kthui commented 4 months ago

We have done some experiment and finds the asymmetry between streaming and non-streaming is depending on whether the model is decoupled or non-decoupled - the cancel event is not triggered on non-decoupled model even if the streaming API is used.

We will do some more experiment and expand the cancellation triggering into non-decoupled models - non-streaming interrupt should also trigger the cancellation.