libcg / grvk

Vulkan-based Mantle API implementation
https://en.wikipedia.org/wiki/Mantle_(API)
zlib License
221 stars 14 forks source link

Descriptor sets re-implementation #21

Closed Cherser-s closed 3 years ago

Cherser-s commented 3 years ago

I've re-implemented descriptor sets logic to properly support nested descriptor sets and dynamic memory binding.

To properly support nested descriptor sets I've added big global descriptor set with 10k indices for each resources, and made GrDescriptorSet a buffer with indices or device pointers to another GrDescriptorSets. This allows to easily bind virtual descriptor sets with specified offsets without calling unnecessary vkUpdateDescriptorSets on every GrDescriptorSet update.

Also moved amdilc shader translation to grCreate*Pipeline functions to properly assign resource index or resource loading to specified descriptor slots.

libcg commented 3 years ago

Nested descriptor sets and dynamic memory views are already implemented.. I'm not sure how this would help.

Cherser-s commented 3 years ago

The problem with the current approach is that it doesn't seem possible to update real descriptors that are already bound to VkCommandBuffer by calling updates on GrDescriptorSet. So if a GrDescriptorSet gets updated after command buffer has been written, VkDescriptorSets inside will be still the same as before (or during) GR_COMMAND_BUFFER recording. This can be solved, but it requires holding references to all appropriate VkDescriptorSets in GrDescriptorSet, which in turn can cause significant overhead on update since there may be many VkDescriptorSets created per single GrDescriptorSet.

My approach allows updating bound descriptor sets through updating GrDescriptorSet without creating a ton of VkDescriptorSet per single bound pipeline in each command buffer and updating them all afterwards on demand.

libcg commented 3 years ago

Ok, that makes sense. I will take a look at it after things settle down. Thanks.

Cherser-s commented 3 years ago

Oh, and I forgot that previous approach would require storing and updating nested descriptor sets which is too complicates the logic as well as causes overhead for updating GrDescriptorSet.

Although I didn't yet implement atomic descriptor "suballocation" from the global VkDescriptorSet, also "virtual descriptor set" (which is a buffer) uses single memory allocation per each buffer, which can cause problems in case if client doesn't treat GR_DESCRIPTOR_SETs as a huge descriptor pool.

Cherser-s commented 3 years ago

Hmmm interesting, Star Swarm doesn't seem to correctly set all descriptor resource indices in GrDescriptorSetMapping, since ILC compiler can't find correct bindings for those UPD: sorry, it's due to the fact that I forgot to check slotType, fixed shaders as well, there is an inconsistency between UAVs texelTypeId and basic resources texelTypeId, which can cause errors if done incorrectly. I've added sampledTypeId to avoid this. There is a weird rendering though, which crashes gpu for me after several seconds, at least I fixed validation layer errors except for invalid buffer usage flags :(.

Also, why star swarm seems to attach memory with undefined format?

libcg commented 3 years ago

Storage buffers don't need a format, and no VkBufferView.

Cherser-s commented 3 years ago

Yeah storage buffers don't, but VkBufferView seem to need it, since validation layer shows up a ton of errors. I create VkBufferView for all bound memory descriptors, since I can't know if the bound buffer will be used only as a storage buffer object, or as a samplerBuffer.

libcg commented 3 years ago

I create VkBufferViews as late as possible when they are actually needed.

Cherser-s commented 3 years ago

I've added some hacks to avoid creating VkBufferViews in case of invalid usage flags or undefined format. That helped to get rid of validation errors, though I think I screwed up storage resource fetching somewhere in shaders, since some ships render incorrectly. I wonder, how I can test struct SRV and UAV resources if CodeXL produces shaders that can't be translated correctly due to those resource index offsets, which AMD's compiler adds.

Also made single-time call for GrDescriptorSet's VkDeviceMemory mapping to host, which increased my fps to around 4 or 5 fps from 2 fps.

Cherser-s commented 3 years ago

Oh, I realized, that I forgot to fill in slots in global descriptor table, which probably caused descriptor aliasing after some time has passed, which obviously caused crashes :rofl:. Now it works fine at least. I should also remove unnecessary storage image binding, so that validation layer won't show up errors.

Can you please test the code as well?

libcg commented 3 years ago

Seems to work on my side without validation layers, I'm seeing ~6fps on the Extreme preset up from 4.

Cherser-s commented 3 years ago

Wow, that still extremely low though, and I update descriptor sets only in grEndDescriptorSetUpdate and mapping GrDescriptorSet's memory to host only once, gotta run some profiles. It's unfortunate that bufferDeviceAddressCaptureReplay isn't implemented in RADV, so I can't view captures in RenderDoc.

UPD: I ran some little profiling, it's probably not a CPU related problem, though descriptor set updates still takes a LOT, but mostly fps is low thanks to gpu, something isn't right with the shaders at all as gpu runs at full capacity during the benchmark.

Cherser-s commented 3 years ago

Also, take note that descriptor sets can't be allocated correctly yet on multiple threads, because allocator code isn't multi-threaded and needs atomics.

Cherser-s commented 3 years ago

I've also found that Star Swarm uses memory type priority in order to select heaps for memory allocation (instead of flags like usual). Replacing priorities with this code:

.gpuReadPerfRating = (flags & VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT) != 0 ? 1.0f : 0.3f, 
.gpuWritePerfRating = (flags & VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT) != 0 ? 1.0f - ((flags & VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT) != 0 ? 0.2f : 0.0f) : 0.3f, // TODO
.cpuReadPerfRating = (flags & VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT) != 0 ? 0.65f : 0.f
        + (flags & VK_MEMORY_PROPERTY_HOST_CACHED_BIT) != 0 ? 0.35f : 0.0f,
.cpuWritePerfRating =  (flags & VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT) != 0 ? 0.65f : 0.f
        + (flags & VK_MEMORY_PROPERTY_HOST_COHERENT_BIT) != 0 ? 0.35f : 0.0f - (flags & VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT) != 0 ? 0.15f : 0.0f,

gave me around extra 2-3 fps.

The benchmark still allocates a ton buffers though, but they only require host visible memory, so it doesn't matter for them. Other resources started to allocate at device local type.

libcg commented 3 years ago

Thanks for the heads up, I pushed a partial fix on master.

Cherser-s commented 3 years ago

I wonder, why does my implementation of descriptor sets drop fps by 2.5-3 times. Probably that doesn't come from push descriptor sets, as using basic VkDescriptorSets for binding dynamic resources doesn't change fps. Unfortunately I can't even run profiles as RGP doesn't support my GPU series.

libcg commented 3 years ago

I have plans to use dynamic storage buffers, it should boost perf quite a bit because it would eliminate a ton of allocations and descriptor writes for consecutive draw calls.

Cherser-s commented 3 years ago

I think it there is no problem with allocations (yet, and this can be easily solved with an allocator), but I think that in my case the problem is, that shaders load resources slowly since there is index calculation for every vertex or fragment and subsequent resource load with that index. I don't know for sure what is really causing the problem though, but I doubt that OpConvertUToPtr is that slow.

Take note, that dynamic storage buffers implementation should support nested descriptor set updates after the command buffer has been recorded.

Cherser-s commented 3 years ago

Damn, those buffer loads for nested descriptor set emulation are extremely slow, as they make draw calls 1.5-3 times slower, despite reducing CPU overhead significantly (70-80fps for my branch, compared to previous 30-50 fps on low settings with Radeon RX 580 GPU).

Cherser-s commented 3 years ago

Ok, it seems I found out, why my implementation is so horribly slow (at least on my machine). It's actually due to the fact that driver doesn't produce code similar to native descriptor set handling.

Natively, ISA code for descriptor sets looks kinda like this:

  s_load_dwordx2  s[0:1], s[2:3], 0x08                      // 00000000: C0400308
  s_waitcnt     lgkmcnt(0)                                  // 00000004: BF8C007F
  s_load_dwordx4  s[4:7], s[0:1], 0x00                      // 00000008: C0820100

Here, one descriptor set loads into SGPR, and then descriptor for the buffer is loaded from there. In case with my implementation, this code gets generated:

  buffer_load_dwordx2  v[1:2], v[1:2], s[8:11], 0 addr64    // 00000020: E0348000 80020101
  s_waitcnt     vmcnt(0)                                    // 00000034: BF8C0F70
  v_readfirstlane_b32  s0, v1                               // 00000038: 7E000501
  s_mulk_i32    s0, 0x0030                                  // 0000003C: B8000030
  s_mov_b32     s2, s6                                      // 00000040: BE820306
  s_mov_b32     s3, s1                                      // 00000044: BE830301
  s_nop         0x0000                                      // 00000048: BF800000
  s_load_dwordx4  s[0:3], s[2:3], s0                        // 0000004C: C0800200

this operation is probably slower, than the former due to buffer_load_dwordx2 instead of s_load_dwordx2 as well as increased instruction count.

So it's pretty much impossible to implement such a thing here, unless the extension for nested descriptors support will be created.

It's still possible to update descriptors after binding though, but it requires updating all allocated descriptor sets on each submit (or track if descriptor set gets updated recursively). I've implemented mutable host only descriptor sets as GrDescriptorSet to save on copying descriptors (in b056863f9c8ea81b5896b379f78b063fb3258cf6).