Closed zakorgy closed 6 years ago
Maybe we don't finish the draw before we read in the release build?
If by "read" you mean fetching in the shader, then no, there shouldn't be any explicit synchronization - Metal should be taking care of it.
I have checked the 'All Resources' tab and looks like both TextureView belongs to a 2048x2048 R8 Texture
Check what slices/mip levels it views and under what format.
So are you basically saying that the texture view doesn't get the proper texels in Release? That sounds pretty bad.
We found that the origin of the issue is the clear (first) step. For some reason in release build we don't clear with the correct values, in debug we have (1.0, 1.0, 1.0, 0.0) which is the expected in release we have some weird numbers as the fragment output. Here are some images from Xcode which show the different fragment bytes.
Debug:
Release:
The fragment values comes from the ps_clear0_float call from the clear.metal shader. We think that the value
variable receives a wrong value in release. Also we have checked manually, what happens if we return with the expected (1.0, 1.0, 1.0, 0.0) value in this call, and the test passed with release build.
We suspect something goes wrong in the clear_attachment call in gfx-backend-metal, but not found any clue yet.
This is where the buffer data gets in: https://github.com/gfx-rs/gfx/blob/122da131665d8a757bdda7323442b9beb9ac2088/src/backend/metal/src/command.rs#L2326
An interesting fact here is that this code is unsafe for the following reason: we borrow the clear color, we take a pointer to the clear floats, but that pointer is only used after the borrow ends. So technically the pointer is invalidated, and we are bad, and we should feel bad.
Could you trace this down in rust code (by adding println!
at least) to see why/how/where that value changed? There is a chance we'll fix it by making this code safer.
Our side should be fixed by https://github.com/gfx-rs/gfx/pull/2402 (hopefully!)
Unfortunately the issue still exists with that patch. :(
Hmm the borrow should exist until this issue https://github.com/gfx-rs/gfx/blob/master/src/backend/metal/src/command.rs#L2393 , right?
Maybe this should look like the vertices
Vec.
Please do try to debug this with println!
s. Hopefully we'll see how it's broken.
If I put println!
s in the code, it actually works. But only because the first number of the value
becomes a 1.0, and the other three will be just memory garbage. And for this shader we only need the first value.
If I put a print here, the value is com = BindBufferData { stage: Fragment, index: 0, words: [1065353216, 1065353216, 1065353216, 0] }
, which is the correct value. But the value will be wrong when used.
If I put a print here and the previous exists as well, the values are
com = BindBufferData { stage: Fragment, index: 0, words: [1065353216, 1065353216, 1065353216, 0] }
com_clear = BindBufferData { stage: Fragment, index: 0, words: [1065353216, 1065353216, 1065353216, 0] }
And the test will pass.
And if i use the second print only, the value is com_clear = BindBufferData { stage: Fragment, index: 0, words: [1065354250, 0, 3906442912, 32766] }
, and the test will pass, because the first number is correct.
What happens if you debug this in XCode (or with rust-lldb directly) on release? Would you be able to set a data breakpoint on those values?
oh snap, I know what's going on: let raw_value = com::ClearColorRaw::from(value)
is on stack!
Ok, this is the fix for sure - https://github.com/gfx-rs/gfx/pull/2403
Yep, this one works. \o/
gfx-rs/gfx#2403 fixes this and a lot of failing wpt tests on our side, I will update #173 according to that.
We captured one of the failing wpt css test(/css/css-backgrounds/border-bottom-left-radius-010.xht) with servo and loaded it with wrench. Here is what we found:
Loading the capture with debug build draws the correct result:
Loading the capture with release build draws the same wrong result we have with servo(release build).
The shader which causes the problem is the
brush_solid
with thealpha_pass
feature. The problematic part is in thedo_clip
call in this line: https://github.com/servo/webrender/blob/4fef6ca5ecce2c90703eace0212caff79374f129/webrender/res/prim_shared.glsl#L256 where we sample thetCache
, which is a 2048x2048 R8 texture. In debug we read the correct value1.0
, in release we read0.0
. I checked the other failing wpt tests and found that thebrush_solid
shader withalpha_pass
is responsible for most of our failing wpt tests.What I have found with xcode: (The
tCache
has the following lifecycle in the renderpass):We clear a rectangular area before the draw. The TextureView id is higlighted.
Then we draw a rectangle on that area. The TextureView id is higlighted and same as before:
We use
tCache
and read it in thebrush_solid
shader indo_clip
. But we have a different TextureView id here:I have checked the 'All Resources' tab and looks like both TextureView belongs to a 2048x2048 R8 Texture:
Maybe we don't finish the draw before we read in the release build? Any thoughts on this @kvark?