nvidia-holoscan / holoscan-sdk

The AI sensor processing SDK for low latency streaming workflows
Apache License 2.0
117 stars 31 forks source link

Freeze/Crashing with MultithreadScheduler #18

Closed genevanmeter closed 8 months ago

genevanmeter commented 10 months ago

Holoscan 0.6 AGX Orin 64GB igpu Jetpack 5.1.1 Docker Python

Experiencing a frozen state or crash when switching to the MulitThreadScheduler. In our application this will occur sometime between 15 minutes - 3 hours. To recreated add the MultiThreadScheduler to a Holoscan or Holohub example.

On that fails particularly fast is holoscan-sdk/examples/holoviz/python/holoviz_geometry.py. I added:

scheduler = MultiThreadScheduler(
      app,
      worker_thread_number=4,
      stop_on_deadlock=True,
      stop_on_deadlock_timeout=500,
      name="multithread_scheduler",
  )
  app.scheduler(scheduler)

Result: Vizualizer stops but the application is still running. When attempting to close the application it will attempt stop the scheduler and freeze again.

^C[info] [multi_thread_scheduler.cpp:607] Stopping multithread scheduler
[info] [multi_thread_scheduler.cpp:664] Stopping all async jobs
[info] [multi_thread_scheduler.cpp:147] Dispatcher thread has no more jobs to check
[info] [multi_thread_scheduler.cpp:206] Dispatcher thread has stopped checking jobs
[info] [multi_thread_scheduler.cpp:649] Waiting to join all async threads
[info] [multi_thread_scheduler.cpp:298] Worker Thread [pool name: default_pool, thread uid: 0] exiting.
[info] [multi_thread_scheduler.cpp:348] Event handler thread exiting.
^C^C
[2024-01-09 13:00:48.413][578][holoscan][debug][multi_thread_scheduler.cpp:382] ready_count_: 1, wait_time_count_: 0, wait_event_count_: 0, wait_count_: 2
[2024-01-09 13:00:48.413][578][holoscan][debug][multi_thread_scheduler.cpp:681] timeout: 500, now: 358236160, last_no_stop_ts_:357339680, should_stop: 0
[2024-01-09 13:00:48.413][580][holoscan][debug][entity_executor.cpp:644] [C00027] tick codelet holoviz
[2024-01-09 13:00:48.415][578][holoscan][debug][multi_thread_scheduler.cpp:382] ready_count_: 2, wait_time_count_: 0, wait_event_count_: 0, wait_count_: 1
[2024-01-09 13:00:48.415][578][holoscan][debug][multi_thread_scheduler.cpp:681] timeout: 500, now: 359982272, last_no_stop_ts_:358236160, should_stop: 0
[2024-01-09 13:00:48.415][583][holoscan][debug][entity_executor.cpp:644] [C00021] tick codelet geometry_layer_object_generator
[2024-01-09 13:00:48.416][578][holoscan][debug][multi_thread_scheduler.cpp:382] ready_count_: 1, wait_time_count_: 0, wait_event_count_: 0, wait_count_: 2
[2024-01-09 13:00:48.416][578][holoscan][debug][multi_thread_scheduler.cpp:681] timeout: 500, now: 361540128, last_no_stop_ts_:359982272, should_stop: 0
[2024-01-09 13:00:48.417][578][holoscan][debug][multi_thread_scheduler.cpp:382] ready_count_: 2, wait_time_count_: 0, wait_event_count_: 0, wait_count_: 1
[2024-01-09 13:00:48.417][578][holoscan][debug][multi_thread_scheduler.cpp:681] timeout: 500, now: 362264416, last_no_stop_ts_:361540128, should_stop: 0
[2024-01-09 13:00:48.417][582][holoscan][debug][entity_executor.cpp:644] [C00010] tick codelet replayer
[2024-01-09 13:00:48.418][578][holoscan][debug][multi_thread_scheduler.cpp:382] ready_count_: 1, wait_time_count_: 0, wait_event_count_: 0, wait_count_: 2
[2024-01-09 13:00:48.418][578][holoscan][debug][multi_thread_scheduler.cpp:681] timeout: 500, now: 363636736, last_no_stop_ts_:362264416, should_stop: 0
[2024-01-09 13:00:48.421][578][holoscan][debug][multi_thread_scheduler.cpp:382] ready_count_: 1, wait_time_count_: 0, wait_event_count_: 0, wait_count_: 2
[2024-01-09 13:00:48.421][578][holoscan][debug][multi_thread_scheduler.cpp:681] timeout: 500, now: 366478880, last_no_stop_ts_:363636736, should_stop: 0
[2024-01-09 13:00:48.422][578][holoscan][debug][multi_thread_scheduler.cpp:382] ready_count_: 0, wait_time_count_: 0, wait_event_count_: 0, wait_count_: 3
[2024-01-09 13:00:48.422][578][holoscan][debug][multi_thread_scheduler.cpp:681] timeout: 500, now: 367022752, last_no_stop_ts_:366478880, should_stop: 1
[2024-01-09 13:00:48.422][578][holoscan][debug][multi_thread_scheduler.cpp:708] Onhold trend to stop on deadlock for [0] ms
[2024-01-09 13:00:48.423][578][holoscan][debug][multi_thread_scheduler.cpp:382] ready_count_: 1, wait_time_count_: 0, wait_event_count_: 0, wait_count_: 2
[2024-01-09 13:00:48.423][578][holoscan][debug][multi_thread_scheduler.cpp:681] timeout: 500, now: 368563488, last_no_stop_ts_:366478880, should_stop: 0
[2024-01-09 13:00:48.423][581][holoscan][debug][entity_executor.cpp:644] [C00027] tick codelet holoviz
[2024-01-09 13:00:48.426][578][holoscan][debug][multi_thread_scheduler.cpp:382] ready_count_: 2, wait_time_count_: 0, wait_event_count_: 0, wait_count_: 1
[2024-01-09 13:00:48.426][578][holoscan][debug][multi_thread_scheduler.cpp:681] timeout: 500, now: 371426560, last_no_stop_ts_:368563488, should_stop: 0
[2024-01-09 13:00:48.426][583][holoscan][debug][entity_executor.cpp:644] [C00021] tick codelet geometry_layer_object_generator
[2024-01-09 13:00:48.427][578][holoscan][debug][multi_thread_scheduler.cpp:382] ready_count_: 3, wait_time_count_: 0, wait_event_count_: 0, wait_count_: 0
[2024-01-09 13:00:48.427][578][holoscan][debug][multi_thread_scheduler.cpp:681] timeout: 500, now: 371957024, last_no_stop_ts_:371426560, should_stop: 0
[2024-01-09 13:00:48.427][580][holoscan][debug][entity_executor.cpp:644] [C00010] tick codelet replayer
grlee77 commented 10 months ago

Hi @genevanmeter,

I think we may have already resolved this issue in our v1.0 development branch. I will try running that example with the scheduler modification you suggest and see if I can reproduce the issue locally.

Do you happen to know if this issue is only seen for the Python version of the app? I ask because one fix that was merged for 1.0 was specifically related to a potential deadlock in multi-threaded Python applications due to an acquisition of Python's GIL during tensor object deletion. If you only see the issue for the Python version of the example, that is likely the cause. Unfortunately, I don't have a good workaround for that GIL-related issue for release 0.6.

genevanmeter commented 10 months ago

@grlee77 Thank you. We are eagerly waiting for the possible fix. Our application only uses Python operators so we can't compare to C++. Our workaround has been to revert back to greedy scheduler and change/remove operators to optimize for the performance we need.

grlee77 commented 10 months ago

I confirmed just now that on the release branch for 0.6, holoviz_geometry.py modified to use the multi-thread scheduler deadlocks almost immediately. On the current 1.0 internal dev branch, I left the app running for almost two hours before closing it and there were no issues.

(I did this testing on an x86_64 workstation rather than on AGX hardware, but don't think that will make a difference in this case)

genevanmeter commented 10 months ago

Thank you for testing. I would gladly test our AGX Orin if I could.

grlee77 commented 8 months ago

Just posting here to confirm the fix for this is in release v1.0.3.

The relevant entry of the bugs fixed section of the release notes:

4293741     Python application with more than two operators (mixed use of pure Python operator and operator wrapping C++ operator), using MultiThreadScheduler (including distributed app) and sending Python tensor can deadlock at runtime.
tbirdso commented 8 months ago

Thanks for confirming @grlee77 . Closing this issue as resolved in the v1.0.3 release.