Numerous segfaults/asserts on mesa 23.2 with ANV_VIDEO_DECODE

kkartaltepe commented 1 year ago

Intel hardware has some initial support for vulkan video decoding but this application appears to fail in numerous ways when attempting to run it on linux+mesa. Most of these appear to be bugs in the sample and not

1) Configure the program for ANV's queue families and queue count 2) Attempt to run the sample with a 10s clip of Big Buck Bunny 3) Crash.

Trying to get the sample working it seems the fixes/workarounds required were

1) Only MapMemorywhen using generateColorPatternRgba888 in vk_video_decoder/libs/VkCodecUtils/VulkanVideoUtils.cpp ImageObject::FillImageWithPattern, because vkFillYuv.fillVkImage also attempts map. Mapping twice is invalid by spec and fails on mesa. 1) Fix dependency on undefined behavior in Shell::AcquireBackBuffer where assert(acquireBuf != nullptr); attempts to check if the backbuffer queue was empty. So you get more reasonable crashes. 1) disable non-fifo present modes in vk_video_decoder/libs/VkShell/Shell.cpp Shell::ResizeSwapchain, the code appears to depend on AcquireNextImage blocking so if other modes are present it will overrun the backbuffer queue and hang or crash. (there is a vsync option in the config but this doesnt appear to affect decode presentation). 1) Reduce DPB size checking invk_video_decoder/libs/NvVideoParser/include/VulkanH264Decoder.h dpb_full() since the newly allocated reference frame has state == 0 it is not counted here despite being counted in the dpb's accounting. The next decode step will then assert that the DPB is out of slots and crash when it attempts to allocate another reference frame.

That should cover everything I needed to get the sample decoder running successfully. It should be noted that incorrect handling of ycbcr sampler attachment, synchronization of cmdbufs, image ownership transitions (due to separate video and present queues), and more result in a large amount of validator noise but the frames presented appear more or less correct.

Thanks for writing up the sample as well it was convenient to have something to test the new functionality with.

zlatinski commented 1 year ago

Thank you for your feedback and suggestions for app improvements, Kurt! Would you mind submitting a pull request with all of the changes you'd like to see in the sample?

kkartaltepe commented 1 year ago

Sure im happy to submit fixes for 1,2 which are are correct. But 3,4 seem like they require more understanding of the pipeline that I don't have (why is dbp_full() called when it is, why are states not set at that time, or what is the synchronization supposed to be between present and decode in the non-fifo case). The fixes of disabling non-fifo modes and reducing dpb size to 15 are definitely not correct at least.

zlatinski commented 1 year ago

Hi Kurt, what is the MESA driver that you are using - what version/distro and HW (Intel, AMD)?

About 4.

The next decode step will then assert that the DPB is out of slots and crash when it attempts to allocate another reference frame.

How many DPB slots are allocated for your video content? Also, if the decode queue keeps running and the display does not consume the frames, the decoder will eventually run out of images.

kkartaltepe commented 1 year ago

Hi Kurt, what is the MESA driver that you are using - what version/distro and HW (Intel, AMD)?

ANV_VIDEO_DECODE=1 env var can be used to enable the vulkan video decode queues on intel's anv driver. This is with the as of yet unreleased 23.2, testing was done on master branch from about a week back. The rest of the system was standard arch linux.

How many DPB slots are allocated for your video content?

16, its H264 content.

Also, if the decode queue keeps running and the display does not consume the frames, the decoder will eventually run out of images.

I don't think the synchronization is a fault for this particular issue, since it occurs without presentation running (and presentation is on a 120hz display versus 30hz content). The dpb_full() check is used to reclaim slots during VulkanH264Decoder::dpb_picture_end which is after allocation of slots in VulkanVideoParser::FillDpbH264State, but doesnt appear to take into account the newly allocated slots so it wont free any slots despite the dpb having 16 reference frames allocated. So when VulkanVideoParser::FillDpbH264State gets called and expects to be able to allocate, instead it asserts.

zlatinski commented 1 year ago

Hi Kurt, I have some fixes submitted based on your feedback. I'll have a look at dpb_full() again, but I doubt there is an issue with that. 16 total may not be enough, based on the content - please try 24 - just to see if it fixes your issues. Can you please provide the value of pnvsi->nMinNumDpbSlots from VulkanVideoParser::BeginSequence() after you are running your content?

kkartaltepe commented 1 year ago

This is the reproducing encoding: https://test-videos.co.uk/vids/bigbuckbunny/mp4/h264/1080/Big_Buck_Bunny_1080_10s_1MB.mp4

Sorry, but on master it seems the code still crashes:

in vk_video_decoder/libs/VkCodecUtils/pattern.cpp:392, with sampleYPtr pointing to 0x1. This seems to go away if I add check on the return value of MapMemory in vk_video_decoder/libs/VkCodecUtils/pattern.cpp:686
- in Shell::AcquireBackBuffer(), now on Assertion !"Swapchain queue is empty!"'.

With noPresent = true, it seems the code runs to completion without crashing though.

zlatinski commented 1 year ago

in vk_video_decoder/libs/VkCodecUtils/pattern.cpp:392, with sampleYPtr pointing to 0x1. This seems to go away if I add check on the return value of MapMemory in vk_video_decoder/libs/VkCodecUtils/pattern.cpp:686

Sorry, I've missed one YUV conversion mapping location. It is fixed now.

zlatinski commented 1 year ago

in Shell::AcquireBackBuffer(), now on Assertion !"Swapchain queue is empty!"'

I don't have a setup to try the Intel's anv MESA driver. I'll make an enquiry to Intel to see if I can find somebody who can look at these two issue.

airlied commented 1 year ago

@zlatinski I think the image count calcs are wrong.

caps.maxImageCount can legally be 0, so you shouldn't limit things to max if it's 0.

airlied commented 1 year ago

I also think the mailbox present might be wrong, but I'm not sure how best to fix it, I'm a bit rusty here.

It looks like you acquire all the images here in mailbox mode, which is fine, since only one image will be waiting for the display, but the code has no throttling. I'd suggest just using FIFO for vsync mode for now

zlatinski commented 1 year ago

maxImageCount

OK, thanks, Dave. Fixed that.

zlatinski commented 1 year ago

I also think the mailbox present might be wrong, but I'm not sure how best to fix it, I'm a bit rusty here.

It looks like you acquire all the images here in mailbox mode, which is fine, since only one image will be waiting for the display, but the code has no throttling. I'd suggest just using FIFO for vsync mode for now

The code actually waits on the display images to be flipped in Shell::AcquireBackBuffer() after obtaining the next image.

    // wait until acquire and render semaphores are waited/unsignaled
    AssertSuccess(m_ctx.devCtx->WaitForFences(*m_ctx.devCtx, 1, &acquireBuf->m_fence, true, UINT64_MAX));

airlied commented 1 year ago

you are relying on acquireNextImage blocking though once you've allocated 3 images but not presented any of them

I'm not sure what the exact semantics of MAILBOX are here, but we aren't blocking on acquireNextImage at all, so you acquires a bunch of images before ever getting a chance to present them, so run out and hit the assert

zlatinski commented 1 year ago

but we aren't blocking on acquireNextImage at all, so you acquires a bunch of images before ever getting a chance to present them, so run out and hit the assert

Oho, Now I see where the issue is. Thanks, @airlied! I can add code to monitor the queue level on the display swap side and wait on the way in when it becomes close to empty. The alternative would be to increase the number of swapchain images, but I don't think we want to do that for resource usage and presentation latency reasons.

charlie-ht commented 1 year ago

I tried this with my CoffeeLake-H GT2 [UHD Graphics 630] but I am getting various errors, the first one in noPresent mode is a crash in anv on the VkVideoDecoder::CopyOptimalToLinearImage path (getting the validation layers clean first might better explain this one, but it seems an issue in the queue mgmt of the sample app),

│      126  static void                                                                                                                                                                     │
│      127  anv_blorp_batch_init(struct anv_cmd_buffer *cmd_buffer,                                                                                                                         │
│      128                       struct blorp_batch *batch, enum blorp_batch_flags flags)                                                                                                   │
│      129  {                                                                                                                                                                               │
│      130     if (!(cmd_buffer->queue_family->queueFlags & VK_QUEUE_GRAPHICS_BIT)) {                                                                                                       │
│  >   131        assert(cmd_buffer->queue_family->queueFlags & VK_QUEUE_COMPUTE_BIT);                                                                                                      │
│      132        flags |= BLORP_BATCH_USE_COMPUTE;                                                                                                                                         │
│      133     }                                                                                                                                                                            │
│      134                                                                                                                                                                                  │
│      135     blorp_batch_init(&cmd_buffer->device->blorp, batch, cmd_buffer, flags);                                                                                                      │
│      136  }

The copy command was recorded inside a pool created with this queue family,

        queueProperties[1]:
        -------------------
                minImageTransferGranularity = (1,1,1)
                queueCount                  = 1
                queueFlags                  = QUEUE_VIDEO_DECODE_BIT_KHR
                timestampValidBits          = 36
                present support             = true
                VkQueueFamilyGlobalPriorityPropertiesKHR:
                -----------------------------------------
                        priorityCount  = 2
                        priorities: count = 2
                                QUEUE_GLOBAL_PRIORITY_LOW_KHR
                                QUEUE_GLOBAL_PRIORITY_MEDIUM_KHR

                VkQueueFamilyQueryResultStatusProperties2KHR:
                ---------------------------------------------
                        queryResultStatusSupport = true

                VkVideoQueueFamilyProperties2KHR:
                ---------------------------------
                        videoCodecOperations: count = 2
                                VIDEO_CODEC_OPERATION_DECODE_H264_BIT_EXT
                                VIDEO_CODEC_OPERATION_DECODE_H265_BIT_EXT

Note it doesn't advertise transfer support (I thought this might have been an implicit requirement of the decode queue, but was missing it in the spec). Anyway, the anv driver is expecting compute capability on a queue that is not graphics. Should the sample app check for queue requirements better here, and maybe use a separate queue for copies on this hardware?

BattleAxeVR commented 1 year ago

+1 on fixing all the validation errors. There's no way a v1.0 shipped version of anything should have any vulkan warnings, let alone errors. I've fixed a few myself (missing pnext struct pointer in one case), but I actually have to disable the VK validation layer when I enable Vulkan Video decoding in my game engine to make sense of anything.

I still have a way more fundamental problem, namely that I'm using HLSL (via DXC -> SPV compilation) for all my ray tracing shaders and I still can't get the immutable 420 / ycbcr decoding sampler to actually decode it to RGB inline. All I see is red = one channel. Instead I have to use a single GLSL fullscreen shader to decode it to an extra copy surface then use its texture from my RTX (HLSL-based) shaders. That extra copy is irritating and unfortunate, although OTOH computing the mips are actually useful (I'm using vulkan video to decode 120 FPS HDR torch flame videos from Embergen, for my RPG and using those to light up the scene with full GI / multiple bounces etc). I'd love to know if anyone else out there has managed to compile an HLSL shader that can decode 420 video properly from Vulkan Video sources. Every one I've come across, including this sample, is GLSL -> glsllangvalidator.exe ----> SPV pipeline.

charlie-ht commented 1 year ago

+1 on fixing all the validation errors.

Since I'm stuck on something else related to the samples app, I made a start on this here: https://github.com/nvpro-samples/vk_video_samples/pull/48

WIP.

lolzballs commented 1 year ago

I can also reproduce Assertion !"Swapchain queue is empty!"' on both radv and amdvlk. Seems like the same MAILBOX present mode issues that is seen on Intel.

zlatinski commented 6 months ago

Fixes for the display swapchain MAILBOX issue from someone who has the HW setup are welcome.

zlatinski commented 6 months ago

Can you please try the display again at 1d9eb74e3d4f03ed56d0d9a0d750d1036d8c63fc?

nvpro-samples / vk_video_samples

Numerous segfaults/asserts on mesa 23.2 with ANV_VIDEO_DECODE #41