szeged / webrender

A GPU-based renderer for the web
https://doc.servo.org/webrender/
Mozilla Public License 2.0
45 stars 7 forks source link

Difference between debug and release build on metal #216

Closed zakorgy closed 6 years ago

zakorgy commented 6 years ago

We captured one of the failing wpt css test(/css/css-backgrounds/border-bottom-left-radius-010.xht) with servo and loaded it with wrench. Here is what we found:

screen shot 2018-09-13 at 13 47 47

  1. We use tCacheand read it in the brush_solid shader in do_clip. But we have a different TextureView id here: screen shot 2018-09-13 at 13 49 03

  2. I have checked the 'All Resources' tab and looks like both TextureView belongs to a 2048x2048 R8 Texture: screen shot 2018-09-13 at 13 49 58

Maybe we don't finish the draw before we read in the release build? Any thoughts on this @kvark?

kvark commented 6 years ago

Maybe we don't finish the draw before we read in the release build?

If by "read" you mean fetching in the shader, then no, there shouldn't be any explicit synchronization - Metal should be taking care of it.

I have checked the 'All Resources' tab and looks like both TextureView belongs to a 2048x2048 R8 Texture

Check what slices/mip levels it views and under what format.

So are you basically saying that the texture view doesn't get the proper texels in Release? That sounds pretty bad.

zakorgy commented 6 years ago

We found that the origin of the issue is the clear (first) step. For some reason in release build we don't clear with the correct values, in debug we have (1.0, 1.0, 1.0, 0.0) which is the expected in release we have some weird numbers as the fragment output. Here are some images from Xcode which show the different fragment bytes.

Debug: screen shot 2018-09-14 at 10 52 45

Release: screen shot 2018-09-14 at 11 09 52

The fragment values comes from the ps_clear0_float call from the clear.metal shader. We think that the value variable receives a wrong value in release. Also we have checked manually, what happens if we return with the expected (1.0, 1.0, 1.0, 0.0) value in this call, and the test passed with release build. We suspect something goes wrong in the clear_attachment call in gfx-backend-metal, but not found any clue yet.

kvark commented 6 years ago

This is where the buffer data gets in: https://github.com/gfx-rs/gfx/blob/122da131665d8a757bdda7323442b9beb9ac2088/src/backend/metal/src/command.rs#L2326 An interesting fact here is that this code is unsafe for the following reason: we borrow the clear color, we take a pointer to the clear floats, but that pointer is only used after the borrow ends. So technically the pointer is invalidated, and we are bad, and we should feel bad. Could you trace this down in rust code (by adding println! at least) to see why/how/where that value changed? There is a chance we'll fix it by making this code safer.

kvark commented 6 years ago

Our side should be fixed by https://github.com/gfx-rs/gfx/pull/2402 (hopefully!)

zakorgy commented 6 years ago

Unfortunately the issue still exists with that patch. :(

zakorgy commented 6 years ago

Hmm the borrow should exist until this issue https://github.com/gfx-rs/gfx/blob/master/src/backend/metal/src/command.rs#L2393 , right? Maybe this should look like the vertices Vec.

kvark commented 6 years ago

Please do try to debug this with println!s. Hopefully we'll see how it's broken.

zakorgy commented 6 years ago

If I put println!s in the code, it actually works. But only because the first number of the value becomes a 1.0, and the other three will be just memory garbage. And for this shader we only need the first value.

zakorgy commented 6 years ago

If I put a print here, the value is com = BindBufferData { stage: Fragment, index: 0, words: [1065353216, 1065353216, 1065353216, 0] }, which is the correct value. But the value will be wrong when used.

If I put a print here and the previous exists as well, the values are

com = BindBufferData { stage: Fragment, index: 0, words: [1065353216, 1065353216, 1065353216, 0] }
com_clear = BindBufferData { stage: Fragment, index: 0, words: [1065353216, 1065353216, 1065353216, 0] }

And the test will pass.

And if i use the second print only, the value is com_clear = BindBufferData { stage: Fragment, index: 0, words: [1065354250, 0, 3906442912, 32766] }, and the test will pass, because the first number is correct.

kvark commented 6 years ago

What happens if you debug this in XCode (or with rust-lldb directly) on release? Would you be able to set a data breakpoint on those values?

kvark commented 6 years ago

oh snap, I know what's going on: let raw_value = com::ClearColorRaw::from(value) is on stack!

kvark commented 6 years ago

Ok, this is the fix for sure - https://github.com/gfx-rs/gfx/pull/2403

dati91 commented 6 years ago

Yep, this one works. \o/

zakorgy commented 6 years ago

gfx-rs/gfx#2403 fixes this and a lot of failing wpt tests on our side, I will update #173 according to that.