jgbit / vuda

VUDA is a header-only library based on Vulkan that provides a CUDA Runtime API interface for writing GPU-accelerated applications.
MIT License
865 stars 35 forks source link

memoryType Cached but not Coherent not supported? #14

Closed astrotuna201 closed 4 years ago

astrotuna201 commented 4 years ago

on MoltenVK & macOS (Vulkan Instance Version: 1.2.131),

Vuda in types.h currently defines

eCachedProperties = VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT | VK_MEMORY_PROPERTY_HOST_COHERENT_BIT | VK_MEMORY_PROPERTY_HOST_CACHED_BIT,

The sample codes are running fine on a Linux machine with an NVIDIA Quadro 6000 RTX, which provides a suitable memory type:

VkPhysicalDeviceMemoryProperties:
=================================
memoryHeaps: count = 2
    memoryHeaps[0]:
        size   = 25769803776 (0x600000000) (24.00 GiB)
        budget = 25395724288
        usage  = 0
        flags:
            MEMORY_HEAP_DEVICE_LOCAL_BIT
    memoryHeaps[1]:
        size   = 151053456384 (0x232b7cd400) (140.68 GiB)
        budget = 151053456384
        usage  = 0
        flags:
            None
memoryTypes: count = 11

[...]

    memoryTypes[10]:
        heapIndex     = 1
        propertyFlags = 0x000e:
            MEMORY_PROPERTY_HOST_VISIBLE_BIT
            MEMORY_PROPERTY_HOST_COHERENT_BIT
            MEMORY_PROPERTY_HOST_CACHED_BIT
        usable for:
            IMAGE_TILING_OPTIMAL: None
            IMAGE_TILING_LINEAR: None

but for the two devices I can test the sample code with, the returned possible memoryTypes are (1) AMD Radeon R9 M370X (and similar for AMD Vega II):

VkPhysicalDeviceMemoryProperties:
=================================
memoryHeaps: count = 2
    memoryHeaps[0]:
        size   = 2147483648 (0x80000000) (2.00 GiB)
        budget = 2147483648
        usage  = 0
        flags:
            MEMORY_HEAP_DEVICE_LOCAL_BIT
    memoryHeaps[1]:
        size   = 17179869184 (0x400000000) (16.00 GiB)
        budget = 47652864
        usage  = 5152768
        flags:
            None
memoryTypes: count = 3
    memoryTypes[0]:
        heapIndex     = 0
        propertyFlags = 0x0001:
            MEMORY_PROPERTY_DEVICE_LOCAL_BIT
        usable for:
            IMAGE_TILING_OPTIMAL: color images, FORMAT_D16_UNORM, FORMAT_D32_SFLOAT, FORMAT_S8_UINT, FORMAT_D24_UNORM_S8_UINT, FORMAT_D32_SFLOAT_S8_UINT
            IMAGE_TILING_LINEAR: None
    memoryTypes[1]:
        heapIndex     = 1
        propertyFlags = 0x0006:
            MEMORY_PROPERTY_HOST_VISIBLE_BIT
            MEMORY_PROPERTY_HOST_COHERENT_BIT
        usable for:
            IMAGE_TILING_OPTIMAL: None
            IMAGE_TILING_LINEAR: None
    memoryTypes[2]:
        heapIndex     = 0
        propertyFlags = 0x000b:
            MEMORY_PROPERTY_DEVICE_LOCAL_BIT
            MEMORY_PROPERTY_HOST_VISIBLE_BIT
            MEMORY_PROPERTY_HOST_CACHED_BIT
        usable for:
            IMAGE_TILING_OPTIMAL: color images
            IMAGE_TILING_LINEAR: None

i.e., I can have CACHED, or COHERENT but not both. However, this is required by Vuda in inc/state/memoryallocator.h:

                // memory properties -> memory types
                m_memoryAllocatorTypes{
                {memoryPropertiesFlags::eDeviceProperties, findMemoryType_Device(physDevice, m_device)},
                {memoryPropertiesFlags::eHostProperties, findMemoryType_Host(physDevice, m_device)},
                {memoryPropertiesFlags::eCachedProperties, findMemoryType_Cached(physDevice, m_device)}

                },

                //
                // memory properties -> unique index
                m_memoryIndexToPureIndex{
                {memoryPropertiesFlags::eDeviceProperties, 0},
                {memoryPropertiesFlags::eHostProperties, 1},
                {memoryPropertiesFlags::eCachedProperties, 2}

by referencing eCachedProperties. The result is throw std::runtime_error("vuda: failed to find suitable memory type !"); as the memory requirement is not fulfilled.

The bandwidth sample code compiles if I re-define

// eCachedProperties = VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT | VK_MEMORY_PROPERTY_HOST_COHERENT_BIT | VK_MEMORY_PROPERTY_HOST_CACHED_BIT,
eCachedProperties = VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT | VK_MEMORY_PROPERTY_HOST_CACHED_BIT,

but then the output is

2020-03-22 17:24:55.420954+0100 vuda_julia[16767:1233491] Metal API Validation Enabled
0x1002bedc0 - validation layer enabled
2020-03-22 17:24:55.460025+0100 vuda_julia[16767:1233491] flock failed to lock maps file: errno = 35

Device: AMD Radeon R9 M370X
Transfer size (MB): 64

TIMING USING HOST:
Pageable transfers
  Host to Device bandwidth (GB/s): 0.585354
  Device to Host bandwidth (GB/s): 0.623467
*** Pageable transfers failed ***

Pinned transfers
  Host to Device bandwidth (GB/s): 6.04681
  Device to Host bandwidth (GB/s): 20.8089
*** Pinned transfers failed ***

TIMING USING EVENTS:
Pageable transfers
  Host to Device bandwidth  (GB/s): 2.70789
  Device to Host bandwidth  (GB/s): 5.27275
*** Pageable transfers failed ***

Pinned transfers
  Host to Device bandwidth  (GB/s): 19.8159
  Device to Host bandwidth  (GB/s): 18.5284
*** Pinned transfers failed ***

Program ended with exit code: 0

Is this an incomplete / wrong implementation in MoltenVK /macOS, or a bug in Vuda?

jgbit commented 4 years ago

Hi there, I can not provide a full answer right now as I am currently in the process of acquiring a macOS setup for tracing these cross platform/architecture discrepancies down. However, as you already noticed different cards comes with different memory types. The Vulkan specification states that there must be at least one memory type with

So, in principle vuda can not assume a memory type with the VK_MEMORY_PROPERTY_HOST_CACHED_BIT. That is why vuda always tries to find a fallback candidate with HOST_VISIBLE+HOST_COHERENT. However, in the current implementation this fails, because vudaFindMemoryType() throws a runtime error instead of returning -1. The fetching function for findMemoryType_Cached() therefore never gets to try finding the fallback candidate. I will make sure that vuda allows this. This was an unfortunate bug introduced with the embedded kernel sample update.

However, defining eCachedProperties = VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT | VK_MEMORY_PROPERTY_HOST_CACHED_BIT should allow the samples to execute (if not incorrectly). Thing is, if the memory type does not have the coherent flag, host management commands needs to be in place to flush memory ranges such that writes are visible for the device or the host, respectively. This might be why you see the transfers failing when redefining eCachedProperties and eCachedInternalProperties.

The memory allocation with the VK_MEMORY_PROPERTY_HOST_CACHED_BIT is used for pageable transfers from device to host in order to gain comparable speeds with cuda (on nvidia architecture at least). You can test this on your Linux system – with and without. Also, weird thing is I seem to remember that the RX Vega’s 16 Gib Heap 1, had the HOST_VISIBLE, HOST_COHERENT, and HOST_CACHED flags at some point. Might be different on macOS or have changed.

Appreciate the feedback.

jgbit commented 4 years ago

Should be fixed now. Let me know.

astrotuna201 commented 4 years ago

sorry, this is still not working, the available memory types are still:

memoryHeaps: count = 2 memoryHeaps[0]: size = 34342961152 (0x7ff000000) (31.98 GiB) budget = 34342961152 (0x7ff000000) (31.98 GiB) usage = 0 (0x00000000) (0.00 B) flags: count = 1 MEMORY_HEAP_DEVICE_LOCAL_BIT memoryHeaps[1]: size = 206158430208 (0x3000000000) (192.00 GiB) budget = 132674641920 (0x1ee4066000) (123.56 GiB) usage = 25579520 (0x01865000) (24.39 MiB) flags: count = 0 None memoryTypes: count = 3 memoryTypes[0]: heapIndex = 0 propertyFlags = 0x0001: count = 1 MEMORY_PROPERTY_DEVICE_LOCAL_BIT usable for: IMAGE_TILING_OPTIMAL: color images, FORMAT_D16_UNORM, FORMAT_D32_SFLOAT, FORMAT_S8_UINT, FORMAT_D24_UNORM_S8_UINT, FORMAT_D32_SFLOAT_S8_UINT IMAGE_TILING_LINEAR: None memoryTypes[1]: heapIndex = 1 propertyFlags = 0x0006: count = 2 MEMORY_PROPERTY_HOST_VISIBLE_BIT MEMORY_PROPERTY_HOST_COHERENT_BIT usable for: IMAGE_TILING_OPTIMAL: None IMAGE_TILING_LINEAR: None memoryTypes[2]: heapIndex = 0 propertyFlags = 0x000b: count = 3 MEMORY_PROPERTY_DEVICE_LOCAL_BIT MEMORY_PROPERTY_HOST_VISIBLE_BIT MEMORY_PROPERTY_HOST_CACHED_BIT usable for: IMAGE_TILING_OPTIMAL: color images IMAGE_TILING_LINEAR: None

hence the only HOST_CACHED memory type available is not HOST_COHERENT

so allocations of eCachedProperties or eCachedInternalProperties still fail at logical device.inl:233 host_cached_node_internal* dstptr = m_cachedBuffers.get_buffer(size, m_allocator);

in cachedbuffer.hpp:33

m_ptrMemBlock = allocator.allocate(vk::MemoryPropertyFlags(memoryPropertiesFlags::eCachedInternalProperties), size);

Perhaps the comment linked below might indicate how to fix this? It requires memory flush /invalidate calls.

https://github.com/KhronosGroup/Vulkan-ValidationLayers/issues/693#issuecomment-463961366

jgbit commented 4 years ago

Hi again and thank you for pursuing the error.

I was finally able to get a real repro case running. The correct invalidate and flushes should be in place now for the HOST_CACHED (non-coherent) memory type and the samples should compile.

Please let me know. Cheers.

astrotuna201 commented 4 years ago

Thanks, that seems to fix compilation! In the bandwidthtest.cpp file I had to adjust the memory allocation like the one used in events_and_bandwith.cpp to pass the memory copy test, though. Many thanks!