Runs out of binding table space on Intel+Linux

gfxstrand commented 4 years ago

What happens?

After about an hour of gameplay (playing Crisis Core: Final Fantasy VII, not that it matters), I hit an assert inside the driver. The assert looks like you've run out of binding tables. They're a finite resource but you've got 1 GiB of them so it's pretty hard to run out. Likely, it's a command buffer recycling bug where you're leaking command buffers.

What should happen?

I should be able to play for hours.

What hardware, operating system, and PPSSPP version? On desktop, GPU matters for graphical issues.

OS: Linux, Fedora 32 Hardware: Intel HD Graphics 5500

unknownbrackets commented 4 years ago

Very interesting. I wonder if it's related to #10850. Another thing the validation layer must be missing?

For better or worse, all our command buffers are VK_COMMAND_BUFFER_USAGE_ONE_TIME_SUBMIT_BIT. We then just rely on the automatic reset from vkBeginCommandBuffer, though we also use VK_COMMAND_POOL_CREATE_RESET_COMMAND_BUFFER_BIT.

VK_COMMAND_POOL_CREATE_RESET_COMMAND_BUFFER_BIT allows any command buffer allocated from a pool to be individually reset to the initial state; either by calling vkResetCommandBuffer, or via the implicit reset when calling vkBeginCommandBuffer. If this flag is not set on a pool, then vkResetCommandBuffer must not be called for any command buffer allocated from that pool.

We don't actually ever call vkResetCommandBuffer() or vkResetCommandPool(), though.

On start, we simply allocate 2 command buffers for each in-flight frame (3 by default), and reuse those constantly. So we should only ever have at most 6, and they should be constantly reset by vkBeginCommandBuffer() every single frame (two at a time, each frame.)

Edit: Sometimes we will delete those initial 6 and create a new six, primarily on resize events from the platform glue. We pretty much reset everything in those cases (and that's where the leak in #10850 seems to be - possibly PPSSPP or a driver.)

Is there another way we could be running out of binding tables?

-[Unknown]

unknownbrackets commented 4 years ago

Maybe we should at least vkResetCommandBuffer() when we get to Submit() and there are no init commands...

It shouldn't cause OOM, so this issue is really separate, but I did notice that if there's a texture upload one frame, and then minutes go by without a new texture upload (or several other things), we'd presumably keep that init buffer around as invalid. As soon as another texture upload happens, we'd begin (reset) it, though.

At most, this could cause 3 command buffers at once (in the invalid state) to hang around referencing old data. It'd never grow past that, and it'd potentially shrink.

-[Unknown]

gfxstrand commented 4 years ago

If you really are using only 3 command buffers and not leaking, then this may be a driver bug. That said, we've not seen anything like this reported in Mesa.

For better or worse, all our command buffers are VK_COMMAND_BUFFER_USAGE_ONE_TIME_SUBMIT_BIT.

That's totally fine. I wouldn't worry about that at all.

We then just rely on the automatic reset from vkBeginCommandBuffer, though we also use VK_COMMAND_POOL_CREATE_RESET_COMMAND_BUFFER_BIT.

That should work fine. I did a reasonably thorough audit of our command buffer reset code and I really don't see a way that this could be happening unless something somewhere is genuinely leaking whole command buffers or pools.

unknownbrackets commented 4 years ago

Hm, I mean, we only allocate those once on init: https://github.com/hrydgard/ppsspp/search?q=vkAllocateCommandBuffers&unscoped_q=vkAllocateCommandBuffers

And that happens when this is called: https://github.com/hrydgard/ppsspp/search?q=T3DCreateVulkanContext&unscoped_q=T3DCreateVulkanContext

And that comes from here: https://github.com/hrydgard/ppsspp/blob/141ad162a244dae525a2a580fcbcc2fc6451ace8/SDL/SDLMain.cpp#L617

Which is just in main() and only happens once. So we really shouldn't possibly have more than those initial ones. I was mistaken thinking that we'd recreate on resize - at least on SDL, we don't do that.

I'd be more concerned about descriptors, texture memory allocations, shader modules, etc. but hard to see a path where we leak command buffers or pools specifically.

-[Unknown]

gfxstrand commented 4 years ago

Crazy.... Well, I'll have to look into this more on the driver side. I can probably easily enough set a breakpoint for where allocation hits a certain threshold and see if I can figure out why it's leaking.

gfxstrand commented 4 years ago

I've filed a Mesa bug as well:

https://gitlab.freedesktop.org/mesa/mesa/-/issues/3100

Best to have it tracked both places until we figure out where the actual bug is.

hrydgard commented 4 years ago

I really believe our code is correct here, for the reasons Unknown has detailed.

gfxstrand commented 4 years ago

I'm fairly convinced our code is correct too. Sadly, one of us is wrong. :-(

hrydgard commented 4 years ago

If it's in any way relevant, we have an "init" and a "main" cmdbuffer in each frame that we submit together - but the init cmdbuffer is recorded from the emulator thread, while the other one gets recorded by the "graphics" thread. Then the graphics thread submits them together.

To turn off having a separate thread, change the line

if (vulkan_->GetPhysicalDeviceProperties().properties.vendorID == VULKAN_VENDOR_AMD) {

to

if (true)

in ext/native/thin3d/VulkanRenderManager.cpp.

To turn off submitting them together, set the hidden ini setting GfxDebugSplitSubmit to true in memstick/PSP/SYSTEM/ppsspp.ini.

I don't think we do anything wrong with the threading though, but might be interesting to toggle these just to see if anything changes.

gfxstrand commented 4 years ago

I found the mesa bug. It turns out it's a fairly recent regression which is why it's not yet been reported. How we make it through a full Vulkan CTS run with it, though, is beyond me. In any case, I'm closing this one.

hrydgard commented 4 years ago

Good to hear that you root-caused it! Would be interesting to know what change in the driver caused the problem.

hrydgard commented 4 years ago

Think I found it :)

https://gitlab.freedesktop.org/jekstrand/mesa/-/commit/dcdf5989530f0c5bcc8e7d7390d36939daaf0a6c

gfxstrand commented 4 years ago

Yup, that's the one!

hrydgard / ppsspp