pygfx / wgpu-py

WebGPU for Python
https://wgpu-py.readthedocs.io
BSD 2-Clause "Simplified" License
462 stars 36 forks source link

Probably naga bug in multiDrawIndirectCount and multiDrawIndirectIndexedCount #592

Open fyellin opened 1 month ago

fyellin commented 1 month ago

AK: problems fixed, now to get the fixes to our users:


After implementing multi_draw_indirect and multi_draw_indexed_indirect, I decided to try the ..._count version of these functions. These functions do not exist on the Mac, but I could get use my test_wgpu_vertex_instance.py file to test my implementation on the GitHub linux machines.

The implementation is straightforward. In extra.py

def multi_draw_indirect_count(
     render_pass_encoder, buffer,*,offset=0,count_buffer, count_buffer_offset=0, max_count
):
    render_pass_encoder._multi_draw_indirect_count(
        buffer, offset, count_buffer, count_buffer_offset, max_count
    )
def multi_draw_indexed_indirect_count(
     render_pass_encoder, buffer,*,offset=0,count_buffer, count_buffer_offset=0, max_count
):
    render_pass_encoder._multi_draw_indexed_indirect_count(
        buffer, offset, count_buffer, count_buffer_offset, max_count
    )

and in api.py:

def _multi_draw_indirect_count(
        self, buffer, offset, count_buffer, count_buffer_offset, max_count
):
        libf.wgpuRenderPassEncoderMultiDrawIndirectCount(
            self._internal, buffer._internal, int(offset), count_buffer._internal, int(count_buffer_offset), int(max_count),
        )

and identically for multi_draw_indexed_indirect_count adding _indexed in the function name and function call.

This definition of these functions are identical to the the non _count versions of the function, except for the determination of the number of draws to perform. That number is the minimum of max_count and the u32 value at byte offset count_buffer_offset in count_buffer.

The format of the buffer is supposed to be the same as the non_count versions of these functions. In particular for the non-indexed version, they are supposed to be packed groups of 4 u32s, starting at byte offset offset. Each group of 4 is [vertex_count, instance_count, first_vertex, first_index]. So my vertex/instance test should generate the set

{ itertools.product(range(first_vertex, first_vertex + vertex_count), range(first_instance, first_instance + instance_count))}

for each of these groups of four. This works correctly on all functions so far.

I've discovered two anomalies running tests on github.

First, the contents of count_buffer is completely ignored. Instead, the count value is fetched from count_buffer_offset in buffer. I double checked that there wasn't a bug in my code and that I wasn't passing buffer as an argument where count_buffer was expected.

I rewrote my tests to put the count into an unused space in buffer. The _indexed_ version worked, but the non-_indexed_ version continued to fail. It seems to be completely mis-parsing the arguments in buffer. It is taking each group of five numbers (not four) [a, b, c, d, e] and then acting as if these were [1, b, d, e] as described above. Then after visiting the (vertex, instance) pairs as described above, it throws in some extra (vertex, instance) pairs that I can't make any sense of. [I did this by creating the buffer 1, 2, 3, .... 100 and then trying to make sense of the results for various offsets.]

I was hoping I could dig through the naga source code and find where the actual count is determined and where the parsing of the values in the buffer occurred. I completely failed. I was hoping to either submit a bug report saying "here's the mistake". I don't know Rust well enough to generate a Rust test case.

I will submit as a PR my current code as a work-in-progress.

Korijn commented 1 month ago

If you can trigger the bug with a unit test in python, that would already be very useful in enabling others to investigate.

fyellin commented 1 month ago

Thanks.

And now I think I'm going mad. After working on this for several hours, the code is suddenly working and my tests are passing. And I haven't (intentionally) changed anything! I'm going to close this bug for now and see if I can work out what has changed.

I was clearly tickling something to cause the bizarre behavior. I just have no idea what. Maybe I'll re-open this tomorrow.

almarklein commented 1 month ago

You hunted the bug, now the bug will hunt you 😉

Vipitis commented 1 month ago

I am not sure if related, but I encountered another fatal error where the rust stack trace points towards wgpu_render_bundle_draw_indexed_indirect when running this shader with wgpu-shadertoy.

Don't have the time to minimize and backtrack right now since a deadline is coming up on Monday - but might be some overlap? Will have time next week to do some investigation myself and open an issue on all the panic/fatal errors

E: it's unrelated I rootcaused this to be something else

fyellin commented 1 month ago

The bug is still there. I had accidentally neutralized a test. I'll update my code to show that that first bug is real! Putting the count into the data buffer makes things work.

Where is the right place to report this bug?

Vipitis commented 1 month ago

if the issue is within naga or the rust side of wgpu, it will be https://github.com/gfx-rs/wgpu, otherwide wgpu-native if you can recreate and maybe fix it there.

fyellin commented 1 month ago

@Vipitis Not sure if this helps, but I get get crashes in the Rust code when I forget to set the pipeline in a pass. In particular, forgetting to set the pipeline in a RenderEncoder (which don't inherit the pipeline from the surrounding code) can be an issue. Not sure if that's what's causing your problems.

Vipitis commented 1 month ago

@fyellin honestly not sure. Open an issue in wgpu with a minimal reproducer, and if they don't tag it in the next few days they might redirect you elsewhere. I will have time on Monday to investigate which commit changed the behavior for the issue I encounter.

I posted a variety of translation issues with naga and some of them took half a year before getting addressed. Sadly I am not comfortable myself with rust to try and fix them. 😕

fyellin commented 1 month ago

Just out of curiosity. When we test this on a Github Linux device, do we have a machine with an actually GPU in it, or are we testing it against an emulator? I could easily believe this was an error in the emulation code.

Looking for every occurrence of draw_indexed_indirect_count everywhere in the gfx-rx/ workspace, the only thing I find is functions calling other functions and passing along the arguments they were received. I haven't yet found where the code actually does something GPU specific

Vipitis commented 1 month ago

@fyellin currently all CI runners for this org are CPU only. So they use lava pipe to run graphics.

I think using real GPUs and other backend has been proposed, but not done yet: #459

If you believe there is an issue that only show sup locally on a real GPU but passes on CI that's definitely a possibility. We were having windows specific problems with surfaces in the recent wgpu-native update, and CI was all green.

fyellin commented 1 month ago

@Vipitis: Just a hypothesis. I admit my search wasn't completely thorough, but I couldn't find any mistakes in the gfx-rs/ code. I have a Mac, and metal doesn't support those two instructions (yet). Any chance that someone with a real Linux box with a real GPU could try grabbing this code running the test tests/test_wgpu_vertex_instance.py, and letting me know what happens? At least we'd know if the bug was gfx-rs or the emulator.

fyellin commented 1 month ago

@Vipitis @almarklein @Korijn

I've created a small test program that illustrates the bugs without any need to pull in my branch. It can be run on any machine that has a recent wgpu. If one of you has a Linux machine with a hardware GPU, would you mind running it and let me know what the output is? This would be definitive as to whether the problem is in gfx-rs or lavapipe.

The code is just a stripped down version of test_wgpu_vertex_instance, and I call libf.wgpuXXX directly. MultiDrawIndirect and MultiDrawIndirectCount should give the same results when the count of the former is the same as the computed count of the latter. This runs both and prints out both results.

And crap. I forgot github won't let me upload a file with extension .py. (Why?) I've renamed it to runner.txt, but it's really runner.py

runner.txt

fyellin commented 1 month ago

Opened https://github.com/gfx-rs/wgpu-native/issues/429

I should probably close this.

almarklein commented 1 month ago

I ran the runner.py on a Linux machine. It has integrated graphics, so it does run on hardware (no emulation), but not a dedicated GPU. I'm not sure if this counts as a "real GPU".

Intel(R) UHD Graphics 730 (ADL-S GT1) (IntegratedGPU) via Vulkan
Running script: "/home/almar/Downloads/runner.py"
-------------------------
data=[0, 0, 1, 2, 3, 4, 5, 6, 7, 8], offset=8, count_data=[1], count_buffer_offset=0, max_count=1
DrawIndirect:      [[3, 4], [3, 5]]
DrawIndirectCount: []
-------------------------
data=[1, 0, 1, 2, 3, 4, 5, 6, 7, 8], offset=8, count_data=[0], count_buffer_offset=0, max_count=1
DrawIndirect:      [[3, 4], [3, 5]]
DrawIndirectCount: [[4, 5], [4, 6]]
-------------------------
data=[2, 0, 1, 2, 3, 4, 2, 3, 4, 5, 3, 4, 5, 6], offset=8, count_data=[0], count_buffer_offset=0, max_count=2
DrawIndirect:      [[3, 4], [3, 5], [4, 5], [4, 6], [4, 7], [5, 5], [5, 6], [5, 7]]
DrawIndirectCount: [[3, 4], [3, 5], [3, 6], [3, 7], [4, 2], [4, 3]]
-------------------------
data=[2, 0, 1, 2, 3, 4, 0, 2, 3, 4, 5, 0, 3, 4, 5, 6, 0], offset=8, count_data=[0], count_buffer_offset=0, max_count=2
DrawIndirect:      [[3, 4], [3, 5]]
DrawIndirectCount: [[4, 0], [4, 1], [5, 0], [5, 1], [5, 2]]

I also ran it in software mode, by tweaking one line in your script:

        # adapter = wgpu.gpu.request_adapter(power_preference="high-performance")
        adapter = wgpu.gpu.enumerate_adapters()[1]
        print(adapter.summary)

This produces a result that is wrong in a different way, I think.

llvmpipe (LLVM 15.0.7, 256 bits) (CPU) via Vulkan
-------------------------
data=[0, 0, 1, 2, 3, 4, 5, 6, 7, 8], offset=8, count_data=[1], count_buffer_offset=0, max_count=1
DrawIndirect:      [[3, 4], [3, 5]]
DrawIndirectCount: []
-------------------------
data=[1, 0, 1, 2, 3, 4, 5, 6, 7, 8], offset=8, count_data=[0], count_buffer_offset=0, max_count=1
DrawIndirect:      [[3, 4], [3, 5]]
DrawIndirectCount: [[3, 4], [3, 5]]
-------------------------
data=[2, 0, 1, 2, 3, 4, 2, 3, 4, 5, 3, 4, 5, 6], offset=8, count_data=[0], count_buffer_offset=0, max_count=2
DrawIndirect:      [[3, 4], [3, 5], [4, 5], [4, 6], [4, 7], [5, 5], [5, 6], [5, 7]]
DrawIndirectCount: [[3, 4], [3, 5], [5, 3], [5, 4], [5, 5], [5, 6], [6, 3], [6, 4], [6, 5], [6, 6], [7, 3], [7, 4], [7, 5], [7, 6]]
-------------------------
data=[2, 0, 1, 2, 3, 4, 0, 2, 3, 4, 5, 0, 3, 4, 5, 6, 0], offset=8, count_data=[0], count_buffer_offset=0, max_count=2
DrawIndirect:      [[3, 4], [3, 5]]
DrawIndirectCount: [[3, 4], [3, 5], [4, 5], [4, 6], [4, 7], [5, 5], [5, 6], [5, 7]]
almarklein commented 1 month ago

Let me know if you have other code to run.

Vipitis commented 1 month ago

At that point you hit the Graphics APIs (Vulkan, metal, dx12) and have to read up on their specs.

fyellin commented 1 month ago

I deleted my comment because I realized I was in Windows Land.

I think I've given up on finding the bug. I think I have ample evidence that it is occurring, though.

fyellin commented 1 month ago

I am curious if there are unit tests for the Rust code (or for the C interface). Maybe I can see something I'm missing.

fyellin commented 1 month ago

Bug has been found. Unfortunately, it's in two different workspaces.

Count being pulled from the wrong buffer: wgpu-core/src/command/render.rs.
In two different places the code has:

     let count_buffer = buffers
            .get_owned(buffer_id)
            .map_err(|_| RenderPassErrorInner::InvalidBuffer(count_buffer_id))
            .map_pass_err(scope)?;

The second line needs to have buffer_id replaced with count_buffer_id.

The second issue is more straightforward.

In wgpu-native/src/lib.rs, the implementation of multiDrawIndirectCount was calling multiDrawIndexedIndirectCount. It was staring me in the face and I didn't see it.

Apparently no one is using the multi-draw-...count commands

fyellin commented 1 month ago

I've submitted a PR to wgpu-native. https://github.com/gfx-rs/wgpu-native/pull/432

The change to wgpu had already been fixed and was submitted last month https://github.com/gfx-rs/wgpu/pull/6194 Can we make sure that this change makes it into our release?

almarklein commented 1 month ago

So it looks like all bugs are fixed and now its a matter of getting upstream releases. I added a small tasklist to the top post.

I think https://github.com/gfx-rs/wgpu-native/issues/429 can be closed, right?