Open fyellin opened 1 month ago
If you can trigger the bug with a unit test in python, that would already be very useful in enabling others to investigate.
Thanks.
And now I think I'm going mad. After working on this for several hours, the code is suddenly working and my tests are passing. And I haven't (intentionally) changed anything! I'm going to close this bug for now and see if I can work out what has changed.
I was clearly tickling something to cause the bizarre behavior. I just have no idea what. Maybe I'll re-open this tomorrow.
You hunted the bug, now the bug will hunt you 😉
I am not sure if related, but I encountered another fatal error where the rust stack trace points towards wgpu_render_bundle_draw_indexed_indirect
when running this shader with wgpu-shadertoy
.
Don't have the time to minimize and backtrack right now since a deadline is coming up on Monday - but might be some overlap? Will have time next week to do some investigation myself and open an issue on all the panic/fatal errors
E: it's unrelated I rootcaused this to be something else
The bug is still there. I had accidentally neutralized a test. I'll update my code to show that that first bug is real! Putting the count into the data buffer makes things work.
Where is the right place to report this bug?
if the issue is within naga or the rust side of wgpu, it will be https://github.com/gfx-rs/wgpu, otherwide wgpu-native if you can recreate and maybe fix it there.
@Vipitis Not sure if this helps, but I get get crashes in the Rust code when I forget to set the pipeline in a pass. In particular, forgetting to set the pipeline in a RenderEncoder (which don't inherit the pipeline from the surrounding code) can be an issue. Not sure if that's what's causing your problems.
@fyellin honestly not sure. Open an issue in wgpu with a minimal reproducer, and if they don't tag it in the next few days they might redirect you elsewhere. I will have time on Monday to investigate which commit changed the behavior for the issue I encounter.
I posted a variety of translation issues with naga and some of them took half a year before getting addressed. Sadly I am not comfortable myself with rust to try and fix them. 😕
Just out of curiosity. When we test this on a Github Linux device, do we have a machine with an actually GPU in it, or are we testing it against an emulator? I could easily believe this was an error in the emulation code.
Looking for every occurrence of draw_indexed_indirect_count everywhere in the gfx-rx/ workspace, the only thing I find is functions calling other functions and passing along the arguments they were received. I haven't yet found where the code actually does something GPU specific
@fyellin currently all CI runners for this org are CPU only. So they use lava pipe to run graphics.
I think using real GPUs and other backend has been proposed, but not done yet: #459
If you believe there is an issue that only show sup locally on a real GPU but passes on CI that's definitely a possibility. We were having windows specific problems with surfaces in the recent wgpu-native update, and CI was all green.
@Vipitis: Just a hypothesis. I admit my search wasn't completely thorough, but I couldn't find any mistakes in the gfx-rs/ code. I have a Mac, and metal doesn't support those two instructions (yet). Any chance that someone with a real Linux box with a real GPU could try grabbing this code running the test tests/test_wgpu_vertex_instance.py, and letting me know what happens? At least we'd know if the bug was gfx-rs or the emulator.
@Vipitis @almarklein @Korijn
I've created a small test program that illustrates the bugs without any need to pull in my branch. It can be run on any machine that has a recent wgpu. If one of you has a Linux machine with a hardware GPU, would you mind running it and let me know what the output is? This would be definitive as to whether the problem is in gfx-rs or lavapipe.
The code is just a stripped down version of test_wgpu_vertex_instance, and I call libf.wgpuXXX directly. MultiDrawIndirect and MultiDrawIndirectCount should give the same results when the count of the former is the same as the computed count of the latter. This runs both and prints out both results.
And crap. I forgot github won't let me upload a file with extension .py. (Why?) I've renamed it to runner.txt, but it's really runner.py
Opened https://github.com/gfx-rs/wgpu-native/issues/429
I should probably close this.
I ran the runner.py
on a Linux machine. It has integrated graphics, so it does run on hardware (no emulation), but not a dedicated GPU. I'm not sure if this counts as a "real GPU".
Intel(R) UHD Graphics 730 (ADL-S GT1) (IntegratedGPU) via Vulkan
Running script: "/home/almar/Downloads/runner.py"
-------------------------
data=[0, 0, 1, 2, 3, 4, 5, 6, 7, 8], offset=8, count_data=[1], count_buffer_offset=0, max_count=1
DrawIndirect: [[3, 4], [3, 5]]
DrawIndirectCount: []
-------------------------
data=[1, 0, 1, 2, 3, 4, 5, 6, 7, 8], offset=8, count_data=[0], count_buffer_offset=0, max_count=1
DrawIndirect: [[3, 4], [3, 5]]
DrawIndirectCount: [[4, 5], [4, 6]]
-------------------------
data=[2, 0, 1, 2, 3, 4, 2, 3, 4, 5, 3, 4, 5, 6], offset=8, count_data=[0], count_buffer_offset=0, max_count=2
DrawIndirect: [[3, 4], [3, 5], [4, 5], [4, 6], [4, 7], [5, 5], [5, 6], [5, 7]]
DrawIndirectCount: [[3, 4], [3, 5], [3, 6], [3, 7], [4, 2], [4, 3]]
-------------------------
data=[2, 0, 1, 2, 3, 4, 0, 2, 3, 4, 5, 0, 3, 4, 5, 6, 0], offset=8, count_data=[0], count_buffer_offset=0, max_count=2
DrawIndirect: [[3, 4], [3, 5]]
DrawIndirectCount: [[4, 0], [4, 1], [5, 0], [5, 1], [5, 2]]
I also ran it in software mode, by tweaking one line in your script:
# adapter = wgpu.gpu.request_adapter(power_preference="high-performance")
adapter = wgpu.gpu.enumerate_adapters()[1]
print(adapter.summary)
This produces a result that is wrong in a different way, I think.
llvmpipe (LLVM 15.0.7, 256 bits) (CPU) via Vulkan
-------------------------
data=[0, 0, 1, 2, 3, 4, 5, 6, 7, 8], offset=8, count_data=[1], count_buffer_offset=0, max_count=1
DrawIndirect: [[3, 4], [3, 5]]
DrawIndirectCount: []
-------------------------
data=[1, 0, 1, 2, 3, 4, 5, 6, 7, 8], offset=8, count_data=[0], count_buffer_offset=0, max_count=1
DrawIndirect: [[3, 4], [3, 5]]
DrawIndirectCount: [[3, 4], [3, 5]]
-------------------------
data=[2, 0, 1, 2, 3, 4, 2, 3, 4, 5, 3, 4, 5, 6], offset=8, count_data=[0], count_buffer_offset=0, max_count=2
DrawIndirect: [[3, 4], [3, 5], [4, 5], [4, 6], [4, 7], [5, 5], [5, 6], [5, 7]]
DrawIndirectCount: [[3, 4], [3, 5], [5, 3], [5, 4], [5, 5], [5, 6], [6, 3], [6, 4], [6, 5], [6, 6], [7, 3], [7, 4], [7, 5], [7, 6]]
-------------------------
data=[2, 0, 1, 2, 3, 4, 0, 2, 3, 4, 5, 0, 3, 4, 5, 6, 0], offset=8, count_data=[0], count_buffer_offset=0, max_count=2
DrawIndirect: [[3, 4], [3, 5]]
DrawIndirectCount: [[3, 4], [3, 5], [4, 5], [4, 6], [4, 7], [5, 5], [5, 6], [5, 7]]
Let me know if you have other code to run.
At that point you hit the Graphics APIs (Vulkan, metal, dx12) and have to read up on their specs.
I deleted my comment because I realized I was in Windows Land.
I think I've given up on finding the bug. I think I have ample evidence that it is occurring, though.
I am curious if there are unit tests for the Rust code (or for the C interface). Maybe I can see something I'm missing.
Bug has been found. Unfortunately, it's in two different workspaces.
Count being pulled from the wrong buffer:
wgpu-core/src/command/render.rs.
In two different places the code has:
let count_buffer = buffers
.get_owned(buffer_id)
.map_err(|_| RenderPassErrorInner::InvalidBuffer(count_buffer_id))
.map_pass_err(scope)?;
The second line needs to have buffer_id
replaced with count_buffer_id
.
The second issue is more straightforward.
In wgpu-native/src/lib.rs
, the implementation of multiDrawIndirectCount
was calling multiDrawIndexedIndirectCount
. It was staring me in the face and I didn't see it.
Apparently no one is using the multi-draw-...count commands
I've submitted a PR to wgpu-native. https://github.com/gfx-rs/wgpu-native/pull/432
The change to wgpu had already been fixed and was submitted last month https://github.com/gfx-rs/wgpu/pull/6194 Can we make sure that this change makes it into our release?
So it looks like all bugs are fixed and now its a matter of getting upstream releases. I added a small tasklist to the top post.
I think https://github.com/gfx-rs/wgpu-native/issues/429 can be closed, right?
AK: problems fixed, now to get the fixes to our users:
After implementing
multi_draw_indirect
andmulti_draw_indexed_indirect
, I decided to try the..._count
version of these functions. These functions do not exist on the Mac, but I could get use mytest_wgpu_vertex_instance.py
file to test my implementation on the GitHub linux machines.The implementation is straightforward. In
extra.py
and in
api.py
:and identically for
multi_draw_indexed_indirect_count
adding_indexed
in the function name and function call.This definition of these functions are identical to the the non
_count
versions of the function, except for the determination of the number ofdraw
s to perform. That number is the minimum ofmax_count
and the u32 value at byte offsetcount_buffer_offset
incount_buffer
.The format of the
buffer
is supposed to be the same as the non_count
versions of these functions. In particular for the non-indexed version, they are supposed to be packed groups of 4u32
s, starting at byte offsetoffset
. Each group of 4 is[vertex_count, instance_count, first_vertex, first_index]
. So my vertex/instance test should generate the setfor each of these groups of four. This works correctly on all functions so far.
I've discovered two anomalies running tests on github.
First, the contents of
count_buffer
is completely ignored. Instead, thecount
value is fetched fromcount_buffer_offset
inbuffer
. I double checked that there wasn't a bug in my code and that I wasn't passingbuffer
as an argument wherecount_buffer
was expected.I rewrote my tests to put the count into an unused space in
buffer
. The_indexed_
version worked, but the non-_indexed_
version continued to fail. It seems to be completely mis-parsing the arguments inbuffer
. It is taking each group of five numbers (not four)[a, b, c, d, e]
and then acting as if these were[1, b, d, e]
as described above. Then after visiting the (vertex, instance) pairs as described above, it throws in some extra (vertex, instance) pairs that I can't make any sense of. [I did this by creating the buffer 1, 2, 3, .... 100 and then trying to make sense of the results for various offsets.]I was hoping I could dig through the naga source code and find where the actual count is determined and where the parsing of the values in the
buffer
occurred. I completely failed. I was hoping to either submit a bug report saying "here's the mistake". I don't know Rust well enough to generate a Rust test case.I will submit as a PR my current code as a work-in-progress.