`RenderingDevice.buffer_get_data` has very high latency

Carbonyte commented 4 months ago

Tested versions

Reproducible in: v4.2.1.stable.mono.official [b09f793f5], v4.3.beta2.mono.official [b75f0485b]

System information

Godot v4.2.1.stable.mono - Windows 10.0.19045 - Vulkan (Forward+) - dedicated Radeon RX 580 Series (Advanced Micro Devices, Inc.; 31.0.21912.14) - AMD Ryzen 5 3600 6-Core Processor (12 Threads)

Issue description

RenderingDevice.buffer_get_data has a very high overhead on small buffers, consistently taking >4ms to retrieve a 4 byte buffer on my device. For comparison, RenderingDevice.texture_get_data usually takes <10us.

Steps to reproduce

extends Node3D

var rd: RenderingDevice

func ssboTest():
    const bufferSize := 4

    var shaderSrc := RDShaderSource.new()
    shaderSrc.source_compute = """
    #version 450

    layout(set = 0, binding = 0, std430) restrict buffer Inp {
        float test[];
    };

    layout(local_size_x = 1, local_size_y = 1, local_size_z = 1) in;
    void main() {
        test[0] = 42;
    }
    """

    var shaderBytecode := rd.shader_compile_spirv_from_source(shaderSrc)
    var shader := rd.shader_create_from_spirv(shaderBytecode)
    var pipeline := rd.compute_pipeline_create(shader)

    var buffer := rd.storage_buffer_create(bufferSize)
    var uniform := RDUniform.new()
    uniform.uniform_type = RenderingDevice.UNIFORM_TYPE_STORAGE_BUFFER
    uniform.binding = 0
    uniform.add_id(buffer)
    var uniformSet := rd.uniform_set_create([uniform], shader, 0)

    var computeList := rd.compute_list_begin()
    rd.compute_list_bind_compute_pipeline(computeList, pipeline)
    rd.compute_list_bind_uniform_set(computeList, uniformSet, 0)
    rd.compute_list_dispatch(computeList, 1, 1, 1)
    rd.compute_list_end()

    rd.submit()
    rd.sync()

    var start := Time.get_ticks_usec()
    var bytes := rd.buffer_get_data(buffer)
    var elapsed := Time.get_ticks_usec() - start
    print("SSBO: Took %s us to read %s bytes, %s MB/s" % [elapsed, bytes.size(), 
        float(bytes.size()) / elapsed * 1000 * 1000 / 1024 / 1024 ])

    rd.free_rid(shader)
    rd.free_rid(buffer)

func textureTest():
    var shaderSrc := RDShaderSource.new()
    shaderSrc.source_compute = """
    #version 450

    layout(set = 0, binding = 0, r32f) uniform restrict writeonly image1D test;

    layout(local_size_x = 1, local_size_y = 1, local_size_z = 1) in;
    void main() {
        imageStore(test, 0, vec4(42));
    }
    """

    var shaderBytecode := rd.shader_compile_spirv_from_source(shaderSrc)
    var shader := rd.shader_create_from_spirv(shaderBytecode)
    var pipeline := rd.compute_pipeline_create(shader)

    var tf := RDTextureFormat.new()
    tf.format = RenderingDevice.DATA_FORMAT_R32_SFLOAT
    tf.texture_type = RenderingDevice.TEXTURE_TYPE_1D
    tf.width = 1
    tf.height = 1
    tf.depth = 1
    tf.array_layers = 1
    tf.mipmaps = 1
    tf.usage_bits = (
        RenderingDevice.TEXTURE_USAGE_CAN_UPDATE_BIT |
        RenderingDevice.TEXTURE_USAGE_STORAGE_BIT |
        RenderingDevice.TEXTURE_USAGE_CPU_READ_BIT |
        RenderingDevice.TEXTURE_USAGE_CAN_COPY_FROM_BIT
    )
    var texture = rd.texture_create(tf, RDTextureView.new(), [])
    var uniform := RDUniform.new()
    uniform.uniform_type = RenderingDevice.UNIFORM_TYPE_IMAGE
    uniform.binding = 0
    uniform.add_id(texture)
    var uniformSet := rd.uniform_set_create([uniform], shader, 0)

    var computeList := rd.compute_list_begin()
    rd.compute_list_bind_compute_pipeline(computeList, pipeline)
    rd.compute_list_bind_uniform_set(computeList, uniformSet, 0)
    rd.compute_list_dispatch(computeList, 1, 1, 1)
    rd.compute_list_end()

    rd.submit()
    rd.sync()

    var start := Time.get_ticks_usec()
    var bytes := rd.texture_get_data(texture, 0)
    var elapsed := Time.get_ticks_usec() - start
    print("Texture: Took %s us to read %s bytes, %s MB/s" % [elapsed, bytes.size(), 
        float(bytes.size()) / elapsed * 1000 * 1000 / 1024 / 1024 ])

    rd.free_rid(shader)
    rd.free_rid(texture)

func _ready():
    rd = RenderingServer.create_local_rendering_device()

    ssboTest()
    textureTest()

Minimal reproduction project (MRP)

repro.zip

clayjohn commented 4 months ago

I think the difference is that you are keeping the Texture in CPU accessible memory. So the data doesn't need to be copied from the GPU.

Out of curiosity, can you try creating the texture without the TEXTURE_USAGE_CPU_READ_BIT flag?

Edit: I ended up testing myself. And can confirm your results. The first run I got about 3000ms for ssboTest() and 8ms for textureTest().

Then i removed TEXTURE_USAGE_CPU_READ_BIT and it became 3000ms and 400ms respectively.

Then I swapped the order of ssboTest() and textureTest() (I called textureTest() first) and then I got 3000ms for textureTest() and 2ms for ssboTest().

Oh, and bonus, if I leave TEXTURE_USAGE_CPU_READ_BIT enabled, and have the calls swapped. Then I got 8ms for textureTest() and 3000ms for ssboTest().

So this is pretty clearly just highlighting the real cost of stalling the GPU to read back data. Whichever one does it first incurs the cost, then the second one doesn't pay for the cost because the GPU has already been stalled. If you keep your texture in CPU accessible memory, then you don't have to pay that cost, which clearly pays off.

Carbonyte commented 4 months ago

Doing your tests on my device gives similar results. The problem then is that there is no equivalent of TEXTURE_USAGE_CPU_READ_BIT for SSBOs exposed through the API. Comparing the source code for storage_buffer_create and texture_create, setting TEXTURE_USAGE_CPU_READ_BIT in turn sets VMA_ALLOCATION_CREATE_HOST_ACCESS_RANDOM_BIT on the driver side when allocating the texture. driver->buffer_create also has options that set this flag, but they aren't exposed by storage_buffer_create.

As an aside, there was one case where your tests produced a different result on my machine. Running textureTest() before ssboTest() with the TEXTURE_USAGE_CPU_READ_BIT enabled, I got 5us for textureTest() and 145us ssboTest(). That is, SSBO access sped up significantly, even though the GPU didn't appear to stall. EDIT: Removing the texture access entirely (that is, just creating the texture without reading from it) gives the same result. I noticed some code in the driver for pooling small allocations, so maybe the SSBO is actually reusing part of the texture's allocation.

godotengine / godot