nv-morpheus / Morpheus

Morpheus SDK
Apache License 2.0
333 stars 124 forks source link

[BUG]: SID example crashes at Triton inference stage (24.03.01 runtime image) #1639

Closed pdmack closed 4 months ago

pdmack commented 4 months ago

Version

24.03.01

Which installation method(s) does this occur on?

Docker, Kubernetes

Describe the bug.

A manual test and a variant of examples/nlp_si_detection/README.md reveals a core dump in a SID pipeline test. Confirmed against both Triton 23.10 and 24.03. In fact, the request is never properly formed by the tritonclient library.

The validation test succeeds ./scripts/validation/sid/val-sid-all.sh. However, that test uses CSV for the input file as opposed to jsonlines.

Minimum reproducible example

morpheus --log_level=DEBUG run --num_threads=3 --edge_buffer_size=4 --use_cpp=True --pipeline_batch_size=8196 --model_max_batch_size=32 pipeline-nlp --model_seq_length=256 from-file --filename=/common/data/pcap_dump.jsonlines monitor --description 'FromFile Rate' --smoothing=0.001 deserialize preprocess --vocab_hash_file=data/bert-base-uncased-hash.txt --truncation=True --do_lower_case=True --add_special_tokens=False monitor --description='Preprocessing rate' inf-triton --force_convert_inputs=True --model_name=sid-minibert-onnx --server_url=ai-engine:8000 monitor --description='Inference rate' --smoothing=0.001 --unit inf add-class serialize --exclude '^ts_' to-file --filename=/common/data/output/sid-minibert-onnx-output.jsonlines --overwrite

Relevant log output

Click here to see error details

 ====Building Segment Complete!====
FromFile Rate[Complete]: 93085 messages [00:00, 125077.06 messaFailed to update context stat: Timer not set correctly. Send time from 1713457558087234741 to 0.ocessing rate: 24588 messages [00:00, 14472.27 messages/s]
E20240418 16:25:58.087323   501 triton_inference.cpp:74] Triton Error while executing 'results->Shape(model_output.name, &output_shape)'. Error: failed to parse the request JSON buffer: The document is empty. at 0
../morpheus/_lib/src/stages/triton_inference.cpp(469)
*** Aborted at 1713457558 (unix time) try "date -d @1713457558" if you are using GNU date ***
W20240418 16:25:58.090701   501 inference_client_stage.cpp:255] Exception while processing message for InferenceClientStage, attempting retry.
Failed to update context stat: Timer not set correctly. Send time from 1713457558091008494 to 0.
E20240418 16:25:58.091076   502 triton_inference.cpp:74] Triton Error while executing 'results->Shape(model_output.name, &output_shape)'. Error: failed to parse the request JSON buffer: The document is empty. at 0
../morpheus/_lib/src/stages/triton_inference.cpp(469)
W20240418 16:25:58.093281   502 inference_client_stage.cpp:255] Exception while processing message for InferenceClientStage, attempting retry.
E20240418 16:25:58.093786   501 triton_inference.cpp:74] Triton Error while executing 'm_client.async_infer( [this, handle](triton::client::InferResult* result) { m_result.reset(result); handle(); }, m_options, m_inputs, m_outputs)'. Error: failed to parse the request JSON buffer: The document is empty. at 0
../morpheus/_lib/src/stages/triton_inference.cpp(113)
E20240418 16:25:58.093787   502 triton_inference.cpp:74] Triton Error while executing 'm_client.async_infer( [this, handle](triton::client::InferResult* result) { m_result.reset(result); handle(); }, m_options, m_inputs, m_outputs)'. Error: failed to parse the request JSON buffer: The document is empty. at 0
../morpheus/_lib/src/stages/triton_inference.cpp(113)
PC: @                0x0 (unknown)
*** SIGSEGV (@0x0) received by PID 485 (TID 0x7fbbd57fe640) from PID 0; stack trace: ***
    @     0x7fbd0f094197 google::(anonymous namespace)::FailureSignalHandler()
    @     0x7fbd11c79520 (unknown)
    @     0x7fbd0dc73cc0 Curl_checkheaders
    @     0x7fbd0dc39824 Curl_http_host
    @     0x7fbd0dc3acbb Curl_http
    @     0x7fbd0dc56cdf multi_runsingle
    @     0x7fbd0dc57dc6 curl_multi_perform
    @     0x7fbd0dc28a5c curl_easy_perform
    @     0x7fbcb0a7974c triton::client::InferenceServerHttpClient::Infer()
    @     0x7fbcb09b13c3 morpheus::HttpTritonClient::async_infer()
    @     0x7fbcb09b3a42 (anonymous namespace)::TritonInferOperation::await_suspend()
    @     0x7fbcb09b69fd _ZN8morpheus28TritonInferenceClientSession5inferEPZNS0_5inferEOSt3mapINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEENS_12TensorObjectESt4lessIS7_ESaISt4pairIKS7_S8_EEEE166_ZN8morpheus28TritonInferenceClientSession5inferEOSt3mapINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEENS_12TensorObjectESt4lessIS7_ESaISt4pairIKS7_S8_EEE.frame.actor
    @     0x7fbcb0957c02 _ZZN8pybind1112cpp_function10initializeIZN3mrc5pymrc16AsyncioScheduler6resumeENS3_14PyObjectHolderENSt7__n486116coroutine_handleIvEEEUlvE_vJEJEEEvOT_PFT0_DpT1_EDpRKT2_ENUlRNS_6detail13function_callEE1_4_FUNESN_
    @     0x7fbcb07bc743 pybind11::cpp_function::dispatcher()
    @     0x5576332e85a6 cfunction_call
    @     0x5576332e1a6b _PyObject_MakeTpCall.localalias
    @     0x5576332a1d90 context_run
    @     0x5576332e02a3 cfunction_vectorcall_FASTCALL_KEYWORDS
    @     0x5576332de205 _PyEval_EvalFrameDefault
    @     0x5576332e8a2c _PyFunction_Vectorcall
    @     0x5576332d8c5c _PyEval_EvalFrameDefault
    @     0x5576332e8a2c _PyFunction_Vectorcall
    @     0x5576332d8c5c _PyEval_EvalFrameDefault
    @     0x5576332e8a2c _PyFunction_Vectorcall
    @     0x5576332d8c5c _PyEval_EvalFrameDefault
    @     0x5576332f46d8 method_vectorcall
    @     0x7fbcb07fa286 pybind11::detail::simple_collector<>::call()
    @     0x7fbcb095fe0c mrc::pymrc::AsyncioRunnable<>::run()
    @     0x7fbcb07ab440 mrc::runnable::RunnableWithContext<>::main()
    @     0x7fbcf8ddf13e _ZNSt17_Function_handlerIFvvEZN3mrc8runnable6Runner7enqueueESt10shared_ptrINS2_8IEnginesEEOSt6vectorIS4_INS2_7ContextEESaIS9_EEEUlvE_E9_M_invokeERKSt9_Any_data
    @     0x7fbcf8d028f5 _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJZNK3mrc6system15ThreadResources11make_threadIN5boost6fibers13packaged_taskIFvvEEEEENS4_6ThreadENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEENS3_6CpuSetEOT_EUlvE_EEEEE6_M_runEv
    @     0x7fbd0f753e95 execute_native_thread_routine
Segmentation fault (core dumped)

Full env printout

Click here to see environment details

 [Paste the results of print_env.sh here, it will be hidden by default]

Other/Misc.

Reducing the thread count to 1 prevents the crash.

Code of Conduct