Closed teoxoy closed 1 month ago
Previews, as seen when this build job started (81ba52fc619e335b95fa4687aec1cb886c6d7b31):
WebGPU webgpu.idl
| Explainer | Correspondence Reference
WGSL grammar.js
| wgsl.lalr.txt
Probably a leftover from when we thought we'd write things such that the device becoming invalid would make all the child objects invalid
On a related note, is it intended that the encoder finalization methods (.finish()
/.end()
) only check for the encoder itself to be valid? I was trying to find some info on what the underlying APIs do when the device was lost and we'd still call these encoder methods (is that ok to do?). Or, regardless if the APIs we target allow this, should our policy be that if the device becomes invalid/lost all methods on its descendants do nothing if they are called?
My assumption would be that these would also make the objects invalid. You can't do anything with these objects anyway, but it's clearer if they are made invalid i nthe spec.
Or, regardless if the APIs we target allow this, should our policy be that if the device becomes invalid/lost all methods on its descendants do nothing if they are called?
That should already the case, except for a few things that are necessary to make lost devices look similar to non-lost devices (for application robustness). Like unmapping buffers or a few other things.
On a related note, is it intended that the encoder finalization methods (
.finish()
/.end()
) only check for the encoder itself to be valid?
This is a very good question, I'm not sure. It feels a little redundant - we could just as well put a check for device loss in every encoder command but what would be the point? But as Corentin says it's also a bit clearer in the spec if we check consistently.
In the end the onus needs to be on the implementation to use the backend API correctly, handling when things blow up due to device loss - the spec doesn't have to say that stuff. Just making most stuff unobservable after device loss should be enough to let implementations do what they need to do.
All together, this has me wondering again if we should just remove most of the device-timeline checks for device loss. Instead we could probably describe the behavior in just one place, for example either:
+@toji for thoughts too
we could just as well put a check for device loss in every encoder command...
we could have an algorithm to "issue steps to the queue timeline" which checks for device loss before issuing them.
If the concern is preventing queue timeline steps on lost devices, then doesn't that imply that we don't need to do anything explicit in the encoder? It doesn't issue any queue steps, just enqueues them for later execution at submit time. As such, a check in submit (and other queue methods) should be sufficient for all things encoder. Right?
Beyond that, it feels like you could cover a lot of implementation details with a non-normative note saying that since lost devices do not execute any work on the GPU implementations are free to optimize away encoder steps which cannot ultimately result in queue work.
That doesn't cover resource creation, but those all have device loss checks at the appropriate points already.
Good point, if just for queue timeline steps, we wouldn't have to do anything encoder commands. But actually one case of "device loss" is that the GPU process crashes, and then we can't even do anything on (the real world equivalent of) the device timeline. Right now an implementation has to do things not explained by the spec in order to pretend to run some of the trivial device-timeline steps (e.g. reject mapAsync).
It's probably overly onerous to try to actually have the spec spell out all those things, but we end up leaning heavily on non-obvious guarantees about what steps are unobservable and therefore never need to run. Explicit checks for device loss put a clear cutoff point.
I just realized encoder commands already all validate the encoder state, so we could just as well check for device loss in there.
I think this validation was missing and makes the creation APIs consistent. I think the underlying APIs also require this but I haven't seen it spelled out.