Closed hybridherbst closed 2 weeks ago
I believe the intent of that is as internal spec convenience.
I'm not really convinced the use case of "we wish to wrap and supplement device events" is something designed to be supported in this regard, and using the notion of primary input devices to do so feels brittle. The proposal here solves the problem for these devices specifically, not in general.
I think wrapping until you know not to seems like an okay call to make.
I also think this can be solved in the Hands API via profiles: it does seem to make sense to expose "primary input capable hand" vs otherwise as a difference in the profiles string.
Unfortunately the current default hands profile is "generic-hands-select", which seems to imply a primary input action, not sure if we should change the default or do something else.
Thanks for the comment. With "wrapping" I don't mean "pretending this is a WebXR event" – I just mean: applications need to detect "hand selection" and that needs to work independent of whether the XRInputSource hand has a select
event or not.
So to summarize:
If I was to add this to the spec, would this be a valid wording:
"Input sources should be treated as auxiliary until the first primary action has happened, then they should be treated as primary."
"Input sources should be treated as auxiliary until the first primary action has happened, then they should be treated as primary."
No, I don't think that's accurate. That is an engineering decision based on a specific use case and does not belong in the standard.
applications need to detect "hand selection" and that needs to work independent of whether the XRInputSource hand has a select event or not.
I guess part of my position is that platforms like Vision should expose a select event if that is part of the OS behavior around hands. It's not comformant of them to not have any primary input sources whatsoever: devices with input sources are required to have at least one primary one.
There's little point attempting to address nonconformance with more spec work.
There's a valid angle for devices that have a primary input but also support hands (I do not believe that is the case here). In general this API is designed under the principle of matching device norms so if a device doesn't typically consider hand input a selection then apps shouldn't either, and apps wishing to do so can expect some manual tracking. That's a discussion that can happen when there is actually a device with these characteristics.
That is an engineering decision based on a specific use case
I disagree – the spec notes what auxiliary and primary input sources are but does not note how to distinguish between them. That makes it ambiguous and impossible to detect what is what.
It's not comformant of them to not have any primary input sources whatsoever
I agree and believe this is a bug in VisionOS; however, their choice may be to expose a transient pointer (with eye tracking) later (which would be the primary input source) and people still want to use their hands to select stuff.
In that case there could even be multiple input sources active at the same time – the transient one and the hand – and there would still need to be a mechanism to detect which of these is a "primary" source and which not.
I disagree – the spec notes what auxiliary and primary input sources are but does not note how to distinguish between them
The spec is allowed to have internal affordances to make spec writing easier. A term being defined has zero implication on whether it ought to be exposed. Were "it's defined in the spec" a reason in and of itself to expose things in the API then a bunch of the internal privacy-relevant concepts could be exposed too.
The discussion here is "should the fact that a hand input can trigger selections be exposed by the API". If tomorrow we remove or redefine the term from the spec, which we are allowed to do, that wouldn't and shouldn't change the nature of this discussion, which is about functionality, not a specific spec term.
however, their choice may be to expose a transient pointer (with eye tracking) later (which would be the primary input source) and people still want to use their hands to select stuff
I addressed that in an edit to my comment above: in that case the WebXR API defaults to matching device behavior, and expects apps to do the same. There's a valid argument to be made about making it easier for apps to diverge, but I don't think it can be made until there is an actual device with this behavior, and it is against the spirit of this standard so still something that's not a slam dunk.
Unfortunately the current default hands profile is "generic-hands-select", which seems to imply a primary input action, not sure if we should change the default or do something else.
In visionOS WebXR the profiles for the hand is ["generic-hand"]
because it does not fire a select event.
@AdaRoseCannon should we update the sepc to include that and allow it as an option?
That might be sensible. It's odd because generic-hand is already included in the WebXR input profiles repo.
@AdaRoseCannon thanks for clarifying! The spec notes that
The device MUST support at least one primary input source.
but it seems that hands are the only input source on visionOS WebXR, and it's not a primary input source. Am I missing something?
I actually think that line should probably be changed. Not all devices have input sources in the first place, and that's otherwise spec conformant.
I think it should instead be "for devices with input sources, at least one of them SHOULD be a primary input source"
I don't think we need to change the primary input source requirement, simply because it should be valid to have the primary input source be transient. (This is the case for handheld AR devices, IIRC). It's somewhat unique for a device like the Vision Pro to expose persistent auxiliary inputs and a transient primary input, but I don't think that's problematic from a spec perspective. It may break assumptions that some apps have made.
I remember discussing the reasons why the hands weren't considered the source of the select events with Ada in the past and being satisfied with the reasoning, I just don't recall it at the moment.
Looking at our code, we emit "oculus-hand", "generic-hand" and "generic-hand-select". Does VSP just emit "generic-hand"? Is Quest browser still allowed to emit "generic-hand"?
@cabanier continuing that discussion on the PR
@cabanier Yes, I can confirm that AVP only returns "generic-hand".
@toji the AVP currently to the best of my understanding does not have "persistent auxiliary inputs and a transient primary input". There is no primary input as far as I'm aware. The assumption it breaks is that there isn't any primary input source (a MUST as per the spec, at least right now).
@Manishearth 's new PR allows for both profile to be exposed. This is matching both implementation so I'm good with that change. This will allow you to disambiguate between VSP and other browsers.
The assumption it breaks is that there isn't any primary input source (a MUST as per the spec, at least right now).
That conflicts with my understanding of the input model from prior conversations with @AdaRoseCannon. That said, I haven't used the AVP yet and it may have been that our discussion centered around future plans that have not yet been implemented. Perhaps Ada can help clarify?
In the initial release of visionOS there was no primary input source, visionOS 1.1 beta (now available) has transient-pointer inputs which are primary input sources.
In the initial release of visionOS there was no primary input source, visionOS 1.1 beta (now available) has transient-pointer inputs which are primary input sources.
Interesting! We have some devices here that we'll update to visionOS 1.1 beta.
Do you have any sample sites that work well with transient-pointer
? We have it as an experimental feature and if it works well, we will enable it by default so our behavior will match.
A THREE.js demo which works well is: https://threejs.org/examples/?q=drag#webxr_xr_dragging but don't enable hand-tracking since THREE.js demos typically only look at the first two inputs and ignore events from other inputs.
Brandon's dinosaur demo also works well, although similar caveat.
I just tried it and created a recording: https://github.com/immersive-web/webxr/assets/1513308/e1247e4b-1985-4a0e-a562-51d6aeb65f06
I will see if it matches Vision Pro.
THREE.js demos typically only look at the first two inputs and ignore events from other inputs.
Are you planning on exposing more than 2 input sources?
I've been thinking about doing the same since we can now track hands and controllers at the same time. I assumed this would need a new feature, or a new secondaryInputSources
attribute.
This is getting a little off topic for the thread, but would you want to expose hands and controllers as separate inputs? A single XRInputSource
can have both a hand
and a gamepad
.
(EDIT: I guess the input profiles start to get messy if you combine them, but it still wouldn't be out-of-spec)
I believe so because if you expose hands and a transient input source, it would be weird if the ray space of the hand suddenly jump and becomes a transient-inputsource.
I just tried it and created a recording
Looks correct to me.
Are you planning on exposing more than 2 input sources? I've been thinking about doing the same since we can now track hands and controllers at the same time. I assumed this would need a new feature, or a new secondaryInputSources attribute.
In visionOS 1.1 if you enable hand-tracking then the transient-inputs appear after the hand-inputs as in elements 2 and 3 in the inputSources array.
I assumed this would need a new feature, or a new secondaryInputSources attribute.
We have events for new inputs being added which can be used to detect the new Inputs I personally don't believe we need another way to inform developers to expect more than two inputs.
I assumed this would need a new feature, or a new secondaryInputSources attribute.
We have events for new inputs being added which can be used to detect the new Inputs I personally don't believe we need another way to inform developers to expect more than two inputs.
I was mostly concerned about broken experiences. I assume you didn't find issues in your testing?
@AdaRoseCannon Are there any experiences that work correctly with hands and transient input? @toji Should we move this to a different issue?
I worry that adding inputsources is confusing for authors and might break certain experiences.
Since every site needs to be updated anyway, maybe we can introduce a new attribute (secondaryInputSources
?) that contains all the input sources that don't generate input events.
/agenda should we move secondary input sources to their own attribute?
I think there are a few cases where it won't be clear which thing is "secondary" and it highly depends on the application.
Example: if Quest had a mode where both hands and controllers are tracked at the same time, there could be up to 6 active input sources:
I think instead of a way to see which input sources may be designated "primary" or "secondary" by the OS, it may be better to have a way to identify which input events are caused by the same physical action (e.g. "physical left hand has caused this transient pointer and that hand selection") so that application developers can decide if they want to e.g. only allow one event from the same physical source.
I don't think it's enough to disambiguate the events. For instance, if a headset could track controllers and hands at the same time, what is the primary input?
If the user is holding the controllers, the controllers are primary and hands are second. However, if they put the controllers down, hands become the primary and controllers are now second.
WebXR allows you to inspect the gamepad or look at finger distance so we need to find a way to let authors know what the input state is. Just surfacing everything will be confusing.
Since every site needs to be updated anyway,
Hold on, does it? I don't think we're requiring any major changes here.
Since every site needs to be updated anyway,
Hold on, does it? I don't think we're requiring any major changes here.
afaik no site today supports more than 2 input sources so they need to be updated to get support for hands and transient-input
The primary issue here is that libraries haven't been following the design patterns of the API.
The API fires select events on the session, and if you're listening for that and firing off interactions based on it then you'll be fine for the types of interactions the Vision Pro is proposing (because it's fundamentally the same as how mobile AR input works today.) But if you've abstracted the input handling to surface select events from the input sources themselves AND trained your users thought example code and library shape to generally only bother to track two inputs at a time (as Three has done) then you're going to have a bad time.
It's truly unfortunate that we've ended up in a situation where most content has optimized itself for a very specific set of inputs when the API itself is ostensibly agnostic to them (and I'm as guilty as anyone when it comes to the non-sample content I've built) but I don't think that we should be OK with making breaking changes to the API because of the choices of the libraries built atop it.
I agree: If content isn't following the spec model of thing (as it can choose to do) I don't think adding more things will make it change its mind on that. Content had the option to treat these things in a more agnostic way, it still does.
The primary issue here is that libraries haven't been following the design patterns of the API.
Indeed, libraries such as aframe have been generating their own events based on either the gamepad or finger distance. Nobody looks at more than 2 inputs so every experience that request hands will be broken on the new vision pro release. The hands spec mentions that it can be used for gesture recognition so we can't really fault developers for using it as a design pattern.
(By "broken" I mean that Vision Pro's intent to use gaze/transient-input as the input is not honored)
It's truly unfortunate that we've ended up in a situation where most content has optimized itself for a very specific set of inputs when the API itself is ostensibly agnostic to them (and I'm as guilty as anyone when it comes to the non-sample content I've built) but I don't think that we should be OK with making breaking changes to the API because of the choices of the libraries built atop it.
My point is that things are already broken. If an experience requests hands and does its own event generation, it will be broken once Vision Pro ships its the next version of its OS. I'm seeing that all the developers on discord are updating their experiences to do their own event generation and that new logic will break in the near future because input is supposed to come from gaze.
My proposal to move secondary inputs to their own attribute will fix this and reduce confusion about what the primary input is. (See my hands and controllers example above) The only drawback is that existing experiences that request hands, will only have transient-input
The primary issue here is that libraries haven't been following the design patterns of the API.
As both library implementor and library user, I can only partially agree. Yes, three.js handles it very minimalistic, as they often do, and that has already caused a number of problems (that are often prompty resolved when they actually happen). Needle Engine for example handles any number of inputs, so I believe the next AVP OS update will "just work" for the most part.
However, I don't think the spec and API explain or handle:
For example, the spec does not state that there is always an exact mapping of "one physical thing must only have one primary input source"; there could be more than one select
events caused by the same physical action ("bending my finger") as per the spec, even if no device (that I'm aware of) does this today. I'm not sure if this is intended; and I'm not sure how anyone could build something entirely "future-proof" based on this unclarity.
I understand that cases like this are seen as "out of scope" for the spec, since they can be implemented on top of what the API returns. Yet, library users expect those cases to be handled or at least want to understand how to handle them. I don't think that counts as "not following the design patterns".
I understand that cases like this are seen as "out of scope" for the spec, since they can be implemented on top of what the API returns. Yet, library users expect those cases to be handled or at least want to understand how to handle them. I don't think that counts as "not following the design patterns".
I agree. Putting every tracked item in inputSources
and leaving it up to authors is not a good indicator on how to handle multiple tracked items. (basing it on the name of the input profile feels like a hack)
Even the name is confusing "inputSources" since on Vision Pro, hands are NOT considered input; gaze is. Likewise on Quest, if you hold controllers, your hands are NOT input or if you put the controllers down, they should stop becoming input.
Maybe instead of secondaryInputSources
, we should call it trackedSources
instead.
As an experiment, I added support for detached controllers to our WebXR implementation so you will now always get hands and controllers at the same time in the inputSources
array.
I can report that every WebXR experience that I tested and that uses controllers was broken in some way with that change. Some only worked if I put the controllers down, others were just rendering the wrong controllers on top of each other and a couple completely stopped rendering because of javascript errors.
This is a clear indication that we can't just add entries to inputSources
.
I guess I'm confused by how that's supposed to improve the situation vs. where we're at now. If we continue to use the input system as originally designed many apps will need to update their input handling patterns to account for new devices. If we introduce a new secondary input array... many apps will still need to update their input handling patterns to account for new devices?
I guess I'm confused by how that's supposed to improve the situation vs. where we're at now. If we continue to use the input system as originally designed many apps will need to update their input handling patterns to account for new devices. If we introduce a new secondary input array... many apps will still need to update their input handling patterns to account for new devices?
I'm saying: If we continue to use the input system as originally designed many apps will break
I want to add support for concurrent hands and controllers but I can't make a change that breaks every site.
I understand that position but I'm trying to consider both the Quest's the Vision Pro's use cases.
Quest, by virtue of being a first mover in the space and the most popular device to date, has a lot of existing content built specifically to target it using abstractions that only really panned out for systems with Quest-like inputs. It's understandable that you're reluctant to break those apps. And I'm not suggesting that we do break them! (I still feel like we can and should expose hand poses and gamepad inputs on the same XrInputSource, but that's a slightly different topic).
An input system like Vision Pro's, however, will already be broken in those apps from day 1, so it's not a choice between breaking apps or not. They're just broken. So unless pages want to ignore Vision Pro users (or have been effectively abandoned by their creators, which we know is common) they'll have to update one way or the other. If updates are going to be mandatory to work on a given piece of hardware then I'd rather not invent new API surface to support it if what we already have serves the purpose.
Now, put bluntly I think that this is something Apple brought on themselves. I'm not a big fan of the limitations imposed by their input system, even if I understand the logic behind it. And I do think that if compatibility with existing apps is a high priority for Safari then there's probably reasonable paths that can be taken to introduce a not-particularly-magical-but-at-least-functional mode where hands emulate single button controllers. But those types of decisions aren't the sort of thing that this group is in the business of imposing on implementations.
An input system like Vision Pro's, however, will already be broken in those apps from day 1, so it's not a choice between breaking apps or not. They're just broken.
I don't believe that is the case. Only sites that request hand tracking will be broken since they won't look at more than 2 input sources. My proposal will fix these sites because now hands will not be in the inputSources array anymore. Those sites should now work with transient-input, although they would no longer display hands.
So unless pages want to ignore Vision Pro users (or have been effectively abandoned by their creators, which we know is common) they'll have to update one way or the other. If updates are going to be mandatory to work on a given piece of hardware then I'd rather not invent new API surface to support it if what we already have serves the purpose.
How do you propose that I surface concurrent hands and controllers? How can I indicate to the author that hands are the primary input or the controllers?
Now, put bluntly I think that this is something Apple brought on themselves. I'm not a big fan of the limitations imposed by their input system, even if I understand the logic behind it. And I do think that if compatibility with existing apps is a high priority for Safari then there's probably reasonable paths that can be taken to introduce a not-particularly-magical-but-at-least-functional mode where hands emulate single button controllers. But those types of decisions aren't the sort of thing that this group is in the business of imposing on implementations.
Correct. Quest surfaces hands as single button controllers if hand tracking is not requested and this seems to work on a majority of sites.
How do you propose that I surface concurrent hands and controllers
They should be the same input source with a hands and gamepads attribute, yes? The spec was designed with this use case in mind
How do you propose that I surface concurrent hands and controllers
They should be the same input source with a hands and gamepads attribute, yes? The spec was designed with this use case in mind
No, they are different input sources. 2 hands and 2 controllers.
Oh I understand now. Not just the case of a hand grasping a controller
Re: surfacing both hands and controllers at the same time.
I was listening to a podcast today that described the new Meta feature Rik has been referring to and they brought up a hypothetical use case of someone strapping controllers to their feet in order to have both hands and tracked feet in something like VRChat. I also extrapolate that out to accessories or attachments where the controller is somewhere other than in the user's hand.
While that would be technically tricky to pull off and I certainly expect it to not be used that way in the majority of cases, it did solidify in my mind the type of scenario where you really do want to treat the hands and the controllers as completely separate input sources, and not just two different data streams on a single source.
Given that, I still do have questions about how an input is determined to be "primary" or "secondary" by the system. I guess that if controllers are present they would generally be considered to be the primary input, though that assumption breaks in the (fairly unorthodox) leg tracking scenario mentioned above.
I also wonder if it's enough from a compatibility standpoint to simply update the spec and state that any primary inputs should appear before any secondary inputs in the input array? (This is distinct from any potential identifiers that might be added to the input object itself). That could get messy with devices like the Vision Pro, though, in which the primary transient input firing a select event might trigger a flurry of devices being removed, added, removed, and added once again.
Re: surfacing both hands and controllers at the same time.
I was listening to a podcast today that described the new Meta feature Rik has been referring to and they brought up a hypothetical use case of someone strapping controllers to their feet in order to have both hands and tracked feet in something like VRChat. I also extrapolate that out to accessories or attachments where the controller is somewhere other than in the user's hand.
One other useful feature is that you can still draw the controllers when you're using hands. Otherwise, if you want to pick them the controllers again, you have to remember where you put them, or lift the headset up to see them.
Given that, I still do have questions about how an input is determined to be "primary" or "secondary" by the system. I guess that if controllers are present they would generally be considered to be the primary input, though that assumption breaks in the (fairly unorthodox) leg tracking scenario mentioned above.
The system knows if you're touching the controllers with your hands. If you're not touching them, hands become the primary input.
I also wonder if it's enough from a compatibility standpoint to simply update the spec and state that any primary inputs should appear before any secondary inputs in the input array? (This is distinct from any potential identifiers that might be added to the input object itself). That could get messy with devices like the Vision Pro, though, in which the primary transient input firing a select event might trigger a flurry of devices being removed, added, removed, and added once again.
Adding and removing controllers is not compatible with current WebXR implementations. We had to remove simultaneous hands and controllers last week because so many websites broke :-\
Options are:
I would even consider that controllers must be static by default (ie there's always only one or two controllers) since so much content is relying on that behavior.
Thanks for the clarifications!
Adding and removing controllers is not compatible with current WebXR implementations.
Could you expand on that a bit? The spec certainly allows for it, and I believe that at the very least the Blink code in Chromium will handle the events properly. Are you referring to how the input devices are detected on the backend, or do you mean that libraries are ignoring the input change events?
Adding and removing controllers is not compatible with current WebXR implementations.
Could you expand on that a bit? The spec certainly allows for it, and I believe that at the very least the Blink code in Chromium will handle the events properly. Are you referring to how the input devices are detected on the backend, or do you mean that libraries are ignoring the input change events?
Sorry, I meant to say "WebXR experiences". AFAIK browsers are doing the right thing. A good number of experiences expect 2 controllers that are always connected.
Developers use engines recommended patterns of using the WebXR APIs in regards of input sources. Some engines provide good async abstraction over input sources, some don't.
In PlayCanvas I've designed async approach, while you can statically access a list of current input sources, developer are encourage to use event-based approach and react to input sources being added/removed. And based on input source capabilities (ray target mode, handedness, hand, etc), developers then add either models, or do raycasts, etc.
That way PlayCanvas apps actually work pretty well with multimodal and various cases of controllers being switched in realtime to hands and any other async add/remove scenarios.
Having hands and controllers at the same time would work also.
There could be some potential issues I can see:
If Meta sees that a lot of experiences are being broken that way, it could be an additional session feature to opt in, that way developers can opt-in for it. Of course by default would be better, but as mentioned above it has consequences.
Also, it would be very useful to know if hand type input source is holding a controller, and if controller is being held, and information of input sources relation, e.g.: inputSource.related: XRInputSource | null
There is definitely a value to have with providing all information of input sources and hands at the same time. Other input trackers would be awesome too! This opens possibilities for experiments and more creative use of controllers.
@toji , @mrdoob, @AdaRoseCannon , Brandel and I had a meeting this week to do a deep dive into this problem space.
I volunteered to update the spec with the trackedSources
attribute. It wil be discussed at the face-to-face in Bellevue at the end of March.
The spec just states the definitions of auxiliary and primary input sources:
but it does not provide a mechanism for applications to query if an XRInputSource does support a primary action.
Is there such a mechanism, and if not, what is the recommended approach
for applications to distinguish between auxiliary and primary input sources?
Usecase description:
select
events, so hands are a "primary input source" there.select
events, so hands are an "auxiliary input source" there.Potential workaround:
selectstart
orsqueezestart
event, mark it as primary and stop emitting wrapper events. While this would kind of work, it still has a risk of sending duplicate events the first time.