Closed genevanmeter closed 8 months ago
Hi @genevanmeter,
I think we may have already resolved this issue in our v1.0 development branch. I will try running that example with the scheduler modification you suggest and see if I can reproduce the issue locally.
Do you happen to know if this issue is only seen for the Python version of the app? I ask because one fix that was merged for 1.0 was specifically related to a potential deadlock in multi-threaded Python applications due to an acquisition of Python's GIL during tensor object deletion. If you only see the issue for the Python version of the example, that is likely the cause. Unfortunately, I don't have a good workaround for that GIL-related issue for release 0.6.
@grlee77 Thank you. We are eagerly waiting for the possible fix. Our application only uses Python operators so we can't compare to C++. Our workaround has been to revert back to greedy scheduler and change/remove operators to optimize for the performance we need.
I confirmed just now that on the release branch for 0.6, holoviz_geometry.py
modified to use the multi-thread scheduler deadlocks almost immediately. On the current 1.0 internal dev branch, I left the app running for almost two hours before closing it and there were no issues.
(I did this testing on an x86_64 workstation rather than on AGX hardware, but don't think that will make a difference in this case)
Thank you for testing. I would gladly test our AGX Orin if I could.
Just posting here to confirm the fix for this is in release v1.0.3.
The relevant entry of the bugs fixed section of the release notes:
4293741 Python application with more than two operators (mixed use of pure Python operator and operator wrapping C++ operator), using MultiThreadScheduler (including distributed app) and sending Python tensor can deadlock at runtime.
Thanks for confirming @grlee77 . Closing this issue as resolved in the v1.0.3 release.
Holoscan 0.6 AGX Orin 64GB igpu Jetpack 5.1.1 Docker Python
Experiencing a frozen state or crash when switching to the MulitThreadScheduler. In our application this will occur sometime between 15 minutes - 3 hours. To recreated add the MultiThreadScheduler to a Holoscan or Holohub example.
On that fails particularly fast is holoscan-sdk/examples/holoviz/python/holoviz_geometry.py. I added:
Result: Vizualizer stops but the application is still running. When attempting to close the application it will attempt stop the scheduler and freeze again.