gpuweb / gpuweb

Where the GPU for the Web work happens!
http://webgpu.io
Other
4.68k stars 304 forks source link

Make sure the device is valid when creating a texture view #4678

Closed teoxoy closed 1 month ago

teoxoy commented 1 month ago

I think this validation was missing and makes the creation APIs consistent. I think the underlying APIs also require this but I haven't seen it spelled out.

github-actions[bot] commented 1 month ago

Previews, as seen when this build job started (81ba52fc619e335b95fa4687aec1cb886c6d7b31): WebGPU webgpu.idl | Explainer | Correspondence Reference WGSL grammar.js | wgsl.lalr.txt

teoxoy commented 1 month ago

Probably a leftover from when we thought we'd write things such that the device becoming invalid would make all the child objects invalid

On a related note, is it intended that the encoder finalization methods (.finish()/.end()) only check for the encoder itself to be valid? I was trying to find some info on what the underlying APIs do when the device was lost and we'd still call these encoder methods (is that ok to do?). Or, regardless if the APIs we target allow this, should our policy be that if the device becomes invalid/lost all methods on its descendants do nothing if they are called?

Kangz commented 1 month ago

My assumption would be that these would also make the objects invalid. You can't do anything with these objects anyway, but it's clearer if they are made invalid i nthe spec.

Or, regardless if the APIs we target allow this, should our policy be that if the device becomes invalid/lost all methods on its descendants do nothing if they are called?

That should already the case, except for a few things that are necessary to make lost devices look similar to non-lost devices (for application robustness). Like unmapping buffers or a few other things.

kainino0x commented 4 weeks ago

On a related note, is it intended that the encoder finalization methods (.finish()/.end()) only check for the encoder itself to be valid?

This is a very good question, I'm not sure. It feels a little redundant - we could just as well put a check for device loss in every encoder command but what would be the point? But as Corentin says it's also a bit clearer in the spec if we check consistently.

In the end the onus needs to be on the implementation to use the backend API correctly, handling when things blow up due to device loss - the spec doesn't have to say that stuff. Just making most stuff unobservable after device loss should be enough to let implementations do what they need to do.

All together, this has me wondering again if we should just remove most of the device-timeline checks for device loss. Instead we could probably describe the behavior in just one place, for example either:

+@toji for thoughts too

toji commented 3 weeks ago

we could just as well put a check for device loss in every encoder command...

we could have an algorithm to "issue steps to the queue timeline" which checks for device loss before issuing them.

If the concern is preventing queue timeline steps on lost devices, then doesn't that imply that we don't need to do anything explicit in the encoder? It doesn't issue any queue steps, just enqueues them for later execution at submit time. As such, a check in submit (and other queue methods) should be sufficient for all things encoder. Right?

Beyond that, it feels like you could cover a lot of implementation details with a non-normative note saying that since lost devices do not execute any work on the GPU implementations are free to optimize away encoder steps which cannot ultimately result in queue work.

That doesn't cover resource creation, but those all have device loss checks at the appropriate points already.

kainino0x commented 3 weeks ago

Good point, if just for queue timeline steps, we wouldn't have to do anything encoder commands. But actually one case of "device loss" is that the GPU process crashes, and then we can't even do anything on (the real world equivalent of) the device timeline. Right now an implementation has to do things not explained by the spec in order to pretend to run some of the trivial device-timeline steps (e.g. reject mapAsync).

It's probably overly onerous to try to actually have the spec spell out all those things, but we end up leaning heavily on non-obvious guarantees about what steps are unobservable and therefore never need to run. Explicit checks for device loss put a clear cutoff point.

I just realized encoder commands already all validate the encoder state, so we could just as well check for device loss in there.