nvidia-holoscan / holohub

Central repository for Holoscan Reference Applications
Apache License 2.0
109 stars 70 forks source link

Getting VK_ERROR_OUT_OF_DEVICE_MEMORY errors when trying to run examples that use Holoviz #439

Closed JTMMoo closed 1 month ago

JTMMoo commented 3 months ago

Hi there,

I am trying to run the example scripts through an AWS EC2 instance. I've followed this tutorial [https://github.com/nvidia-holoscan/holohub/tree/main/tutorials/holoscan-playground-on-aws] and chose NVIDIA GPU-Optimized AMI as suggested. I am also X11 port forwarding using Windows SSH client.

However, when I try to run the example scripts, I'm given these error codes suggesting that there isn't sufficient device memory?

From within the holoscan container Input: python3 /opt/nvidia/holoscan/examples/holoviz/python/holoviz_geometry.py

Output: [info] [fragment.cpp:569] Loading extensions from configs... [info] [gxf_executor.cpp:249] Creating context [info] [gxf_executor.cpp:1912] Activating Graph... [info] [gxf_executor.cpp:1944] Running Graph... [info] [gxf_executor.cpp:1946] Waiting for completion... 2024-07-23 10:15:53.224 INFO gxf/std/greedy_scheduler.cpp@191: Scheduling 3 entities [info] [context.cpp:50] _______________ [info] [context.cpp:50] Vulkan Version: [info] [context.cpp:50] - available: 1.3.204 [info] [context.cpp:50] - requesting: 1.2.0 [info] [context.cpp:50] ______________________ [info] [context.cpp:50] Used Instance Layers : [info] [context.cpp:50] [info] [context.cpp:50] Used Instance Extensions : [info] [context.cpp:50] VK_KHR_surface [info] [context.cpp:50] VK_KHR_xcb_surface [info] [context.cpp:50] VK_EXT_debug_utils [info] [context.cpp:50] VK_KHR_external_memory_capabilities [info] [context.cpp:50] ____________________ [info] [context.cpp:50] Compatible Devices : [info] [context.cpp:50] 0: NVIDIA A10G [info] [context.cpp:50] Physical devices found : [info] [context.cpp:50] 1 [info] [context.cpp:50] ________________________ [info] [context.cpp:50] Used Device Extensions : [info] [context.cpp:50] VK_KHR_swapchain [info] [context.cpp:50] VK_KHR_external_memory [info] [context.cpp:50] VK_KHR_external_memory_fd [info] [context.cpp:50] VK_KHR_external_semaphore [info] [context.cpp:50] VK_KHR_external_semaphore_fd [info] [context.cpp:50] VK_KHR_push_descriptor [info] [context.cpp:50] VK_EXT_line_rasterization [info] [context.cpp:50] [info] [vulkan_app.cpp:845] Using device 0: NVIDIA A10G (UUID fafb899f6250bdb4d5e5281414a95e48) [error] [context.cpp:56] /workspace/holoscan-sdk/modules/holoviz/thirdparty/nvpro_core/nvvk/swapchain_vk.cpp(172): Vulkan Error : VK_ERROR_OUT_OF_DEVICE_MEMORY [error] [context.cpp:56] /workspace/holoscan-sdk/modules/holoviz/thirdparty/nvpro_core/nvvk/swapchain_vk.cpp(172): Vulkan Error : VK_ERROR_OUT_OF_DEVICE_MEMORY [error] [context.cpp:56] /workspace/holoscan-sdk/modules/holoviz/thirdparty/nvpro_core/nvvk/swapchain_vk.cpp(172): Vulkan Error : VK_ERROR_OUT_OF_DEVICE_MEMORY [error] [gxf_wrapper.cpp:57] Exception occurred when starting operator: 'holoviz' - Failed to update swap chain. 2024-07-23 10:16:02.610 WARN gxf/std/entity_executor.cpp@495: Failed to start entity [holoviz] 2024-07-23 10:16:02.610 WARN gxf/std/greedy_scheduler.cpp@243: Error while executing entity 21 named 'holoviz': GXF_FAILURE 2024-07-23 10:16:02.610 ERROR gxf/std/entity_executor.cpp@586: Entity [holoviz] must be in Started, Tick Pending, Ticking or Idle stage before stopping. Current state is StartPending 2024-07-23 10:16:02.611 INFO gxf/std/greedy_scheduler.cpp@401: Scheduler finished. [error] [program.cpp:574] wait failed. Deactivating... [error] [runtime.cpp:1476] Graph wait failed with error: GXF_FAILURE [warning] [gxf_executor.cpp:1947] GXF call GxfGraphWait(context) in line 1947 of file /workspace/holoscan-sdk/src/core/executors/gxf/gxf_executor.cpp failed with 'GXF_FAILURE' (1) [info] [gxf_executor.cpp:1957] Graph execution finished. [error] [gxf_executor.cpp:1965] Graph execution error: GXF_FAILURE Traceback (most recent call last): File "/opt/nvidia/holoscan/examples/holoviz/python/holoviz_geometry.py", line 326, in <module> main(config_count=args.count) File "/opt/nvidia/holoscan/examples/holoviz/python/holoviz_geometry.py", line 313, in main app.run() RuntimeError: Failed to update swap chain. [info] [gxf_executor.cpp:278] Destroying context

Please let me know if there's something I'm missing here...

AndreasHeumann commented 3 months ago

Could you check the NVIDIA driver version? There is a regression with 545 and 550 drivers, might be the same as https://github.com/nvidia-holoscan/holohub/issues/394.

JTMMoo commented 3 months ago

Thanks for the quick response @AndreasHeumann , here's what I see when I run nvidia-smi:

image

looks like its using driver version: 550.54.15

edit: I've just checked the thread you've linked, is it possible for you to link a guide on how to update the driver?

AndreasHeumann commented 3 months ago

There is a AWS user guide for updating the NVIDIA driver here: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/install-nvidia-driver.html.

JTMMoo commented 3 months ago

Updated to 555, but the example script is still not displaying the video unfortunately...Any ideas?

image

AndreasHeumann commented 3 months ago

Log file looks good. Do all the examples show this grey screen only?

Could you run nvidia-smi to make sure the driver is loaded correctly.

Also could you test a simple Vulkan example like vkcube (install vulkan-tools package with apt first), Holoviz is using Vulkan.

JTMMoo commented 3 months ago

Log file looks good. Do all the examples show this grey screen only?

Yes, all holoviz examples showing this grey screen

Running nvidia-smi gives:

image

vkcube isn't working as you might have suspected, the output shows Selected GPU 0: NVIDIA A10G, type: 2 but the window is showing a blank screen.

Running vulkaninfo --summary gives the following, so it's recognising the GPU correctly: image

AndreasHeumann commented 3 months ago

I don't know why you are seeing a grey screen for Vulkan apps with the 555 driver. Could you try the 535 driver?

JTMMoo commented 3 months ago

No luck with the 535 driver unfortunately...can't even open a window with vkcube:

image

Same error when trying to run the video_replayer.py example:

image

Could it be an issue related to Xming / Xserver on my local Windows PC?

tbirdso commented 1 month ago

Hi @JTMMoo , closing this issue as inactive. From brief discussion with @AndreasHeumann this is likely an issue in the X setup and not in Holoscan SDK. Please re-open the issue if you find more information pointing to a Holoscan SDK as the root cause.