Closed toji closed 6 years ago
Moving the interpretation of common higher level semantic events down into the user agent is clearly necessary to support the creation of XR apps that will work across the many existing platforms and going forward on new platforms.
One thing that's not clear to me is how we pick the small set of initial semantic events and how we extend that set in the future. How do we draw the line between the initial set of events and the events that also feel somewhat inevitable, like secondary selection?
This is looking really good -- I feel it captures well most of the concepts from our earlier iterations and meetings.
I have just a couple small suggestions:
In the case when a user begins a gesture (ie pressed a button down) and the content loses VR focus or positional tracking before the button is released, the user likely does not want the gesture to be acted upon. Rather than emitting a onselectend event, the browser could emit the onselectcancel event. Alternately, we could still fire onselectend, but include a "reason" attribute.
"out", "back", "undo", "exit", "home", "cancel" seem to be the second-most common interactions after "select" and could be handled by this event. Unlike select, this one would not have a raycast but would have onbackstart, onbackend, and onback events. Having start/end for this would enable long presses to go "home" while short pressed to go "back" for example.
It looks good to me. @toji How do you see the proposed XRInputSource
API evolving to accommodate input data beyond buttons and axis like for instance a fully tracked hand or body suit?
If the controller does not have any spare physical buttons to represent the "onback" event, it could recognize this in other ways, enabling consistency across all WebVR experience. Examples:
@TrevorFSmith: This is probably the question that's caused me more stress than anything else lately. The most straightforward answer I've come up with is that if we're concerned about the divide we can jump directly to the getAction
approach I outlined in the "Future directions" section and simply start with only one action. In order to feel confident about that approach, though, I think we'd want to be confident that it was going to be the solution long term, and I'm simply not yet. That could change with some API iteration.
I brought this up with @bfgeek prior to posting this issue, and his suggestion was just to do the simple, straightforward thing for the moment and not overcomplicate things. "Primary action" is a fundamental enough concept that if it becomes the sole input boondoggle against a more robust future system it won't look too weird.
@dmarcos: My gut tells me that body suits are not applicable here. For one, there's no hardware in wide use that does it yet, and thus the interfaces to it are undefined. You could speculate it looks something like a Kinect skeleton, but that may not be the end product for a variety of reasons. Two, even if you have full body tracking it's unlikely that developers want to treat a full body as equatable to existing controllers, which are almost always tracking hands in one form or another. I can envision a system where XRSkeleton
and XRInputSource
live alongside eachother, feeding off of the same underlying data: One for tracking the user's full pose and one that abstracts their hands into a more traditional pointing device for compatibility.
As for hand tracking, which I view as a more specialized case, we did talk about that in the context of the previous explainer and the hope was that we could attach a nullable hand skeleton directly to the XRInputSource
when applicable, which could feasibly represent either Leap Motion-style visual tracking or Oculus Touch-style pose estimation. Doesn't seem appropriate for a v1 feature, but I do like having an idea of where that data could be attached.
@kearwood: +1 to the idea of communicating that the input was canceled somehow, though I'm curious if it applies in any situation other than focus lost? Can't think of any at the moment. Also, I'm hesitant to create a bunch of separate events, I kind of like your suggestion for adding a reason, though I'd maybe say take it a step further and have a more general selectchange
event that includes the current state. (That's edging closer the the "Future directions" suggestion, not sure how far down that rabbit hole I want to fall.)
As for things like back/home/exit/etc I'm more conflicted. They absolutely make sense for browsing mode, but in that case the UA should handle it entirely. In WebXR content it seems like a lot of them are going to be context-sensitive and we don't have a reasonable way of knowing the context at this time. Certainly we can't be guaranteed that there will be enough discreet buttons to cover all of the desired functions, and if we enforce certain gestures we end up limiting the apps input possibilities. For example: I'd hate to prevent any games that simulate swordplay because all of that rapid slashing was interpreted as a back gesture.
So I think that until we start dipping our toes into declarative land we should ensure that the developer is fully in control of how they interpret inputs and we simply surface them in as consistent a manner as possible. That said, making it easier to recognize certain common gestures (touchpad swipes, double clicks, etc) seems like a unquestionably good idea. I'm not sure if the browser can provide anything there that a well built library can't, though. (This is, of course, assuming we have made more of the input state available to the page)
This looks really good.
For the controllers you get a Matrix but if the controller only has rotation capabilities should the browser return only a Matrix with rotation component? Perhaps it should estimate position components based upon a shoulder/arm rig once it knows which hand the user is holding the controller in, what are your thoughts on this?
@AdaRoseCannon: I agree that an arm model should implicitly be part of the matrix returned for 3DoF controllers. The only caveat is that we should ensure that the arm model doesn't alter the rotation of the controller, only it's offset. That way developers who care exclusively about the rotation (driving games, etc.) can trivially zero out the translation components and be left with an accurate orientation. We've confirmed that this is how the Daydream arm models work, and I would imagine that GearVR is the same.
That said, we should probably indicate when the position is emulated rather than sensor-based. That was in the original explainer but I took it out here for simplicity and because I hated the name. :wink: Maybe xrInputSource.positionEstimated
?
As for which side the user's holding it on I'd expect the platform to know what the preferred hand is, so we should be able to just use that.
Looks good! I think the getAction
approach is the most extensible and should be the only way to receive interaction events from the inputs. The method can in the future return more than one type of actions (interfaces) to allow variety of types of input with incompatible sets of attributes without requiring to extend the core input source interface.
What I'm concerned about is how this proposal addresses the input mode where the input is continuous, not a sequence of discrete events (actions, gestures).
@toji re: +1 to the idea of communicating that the input was canceled somehow, though I'm curious if it applies in any situation other than focus lost?
Some other situations:
Really like this. Big +1 to the arm model, and as long as v1 is extensible to all possible buttons on an input device, I think it's great
I like this proposal. I'm fine with adding it to the GamePad API, though that API needs some work as well (which I've been digging into lately). I'm confused about how the developer can tell which sort of device the user currently has, and therefore if they should draw the gaze cursor and other interaction overlays. Does the developer look for the existence of the gripMatrix vs pointerMatrix objects?
I'm confused about how the developer can tell which sort of device the user currently has, and therefore if they should draw the gaze cursor and other interaction overlays.
This was definitely under-specified in my above text, and I'm not sure I'm totally satisfied with the implied proposal I made.
The code sample shows that if there's no tracked controllers then the device is implied to use a gaze cursor. That seems to fit for Cardboard-style devices and probably HoloLens uses (even though that does provide limited hand tracking, it doesn't expose hand position frame-to-frame AFAIK). However on further reflection this wouldn't really be appropriate for something like the GearVR's headset controls or the Oculus Remote. Those provide more than one-bit inputs, so presumably we'd like to expose them eventually, but are not actually tracked independently from the headset. I guess there's a few different attitudes we could take regarding those:
xrSession.getInputDevices()
, adding data to indicate their tracking capabilities (or lack thereof)xrSession.getInputDevices()
for tracked devices only. select
event could still fire.I like the proposal and agree with some of @kearwood comments to include the cancel event. Not sure about not using a different event handler though. I think it makes it more consistent (although I understand not wanting to swamp the API with handlers).
Some notes:
If the type is InputSource (I prefer InputDevice, but I guess there is an argument to some types of inputs not being devices per se?), then the call should be getInputSources right?
How will the app know what to render as the controller? For example, how does the app know that a Vive controller or an Oculus Touch controller should be rendered to match the type of device the user is handling? Or, as this is a high level abstraction that does not provide access to other buttons all the controllers should be rendered using a generic 3D model?
This is looking really good, @toji ! Thanks for pulling the key parts of the original proposal forward.
@judax yeah the intent with the naming is to also be representative of hand input (such as on HoloLens). Also, we had an optional glTF blob off of InputSource that maybe got missed somewhere in the shuffle. @toji , was that on purpose or can we just grab it and add it back in?
@kearwood As far as "cancel" goes, I agree that it's worth having!
It's great that we're taking a look at how we might integrate with gamepad going forward to help ensure our plan is solid. At the same time, let's be careful not to go too far down the road of designing the future solution when the whole point of this new propsoal is the address the fact that we aren't ready to do so yet ;)
As per the Implementers' Call:
We should ensure that content can know when it should draw a controller or not, so that controllers are not drawn on top of your physical hands/controllers in an AR environment.
I'd alter that statement a bit: We want to inform content about when it should draw controllers mostly to avoid having it draw a controller model for a gaze cursor. I'd argue that in an AR scenario (however it is that we identify that) we never want to draw the controller unless the intent is controller replacement. (And that's generally not something that can be done in a high quality way today). You would still likely want to draw the cursor/ray in an AR situation.
Following up with some more notes from yesterday's call: As Kip mentioned, one of the questions floating around is how do we clearly communicate what the user needs to draw in each situation. I get the feeling from the conversations we've had that an implicit guideline of "If A, B, and C are true do this" may fall short of helping us provide consistent usage across sites.
Also, there's a question of what should show up in the array of returned devices. Tracked controllers seem like an obvious yes. I was under the impression that tracked hands might be a no, but Nell (who's in a position of more experience in that regard) says yes. There's differing opinions about whether or not a regular, non-tracked gamepad should show up in the list, with some suggesting that maybe we'd want a way to explicitly associate them, and it seems most people (but not everyone) think that a touchscreen for a mobile device magic window should not show up in the array, even though it would be generating events.
Input Variants
I think a useful excerise at this point is to list out as many different input examples as we can think of and propose how they would map to the system.
Scenarios I'm aware of (please respond with more if you can think of them):
For 3/6DoF tracked controllers you generally want to render controllers, pointers, and cursors (referring only to simple selection mode. Obviously not for more specialized uses.) These obviously
Untracked VR inputs across the board should only render a cursor, since they're all implicitly functioning as triggers for a gaze cursor AFAICT. Same for gamepad and vocal commands. We also want to be considerate of the fact that these may end up being more complex than just single buttons.
Touchscreens are unique in that there is no frame-to-frame data to be had, and no cursors should be rendered. The only time we know what the input ray is going to be is when the users finger hits the canvas, and at that point rendering a cursor/pointer/controller is pointless because it's obscured. Thus nothing should be rendered for touchscreen inputs.
Mouse input is interesting, Windows Mixed Reality has some affordances for this. In their usage the mouse cursor actually runs along the virtual surfaces, which is obviously not possible for the level of API we're talking about to handle opaquely. In previous conversations about this I beleive we determined that if users wanted this behavior they could track mouse deltas themselves and cast a cursor into the world, probably while pointer locked. Otherwise the mouse should act more or less like the touchscreen scenario above.
Honestly I'm not sure about tracked hands. HoloLens generally treats them as ~= a bluetooth clicker, Leap Motion likes to think of them more like controllers. This is probably one of those situations where we really should be giving the user very direct guidance on platform conventions.
API representation
So taking the above into account, in my opinion the question of what shows up in the input devices array boils down largely into "do I need to visually represent this input each frame?" (Keeping in mind that the visual representation may simply be a gaze cursor) Which shakes out like this:
So now comes the really interesting question: If I'm on an Oculus Rift with Touch controllers AND I have an Oculus Remote kicking around, does it still show up as a third input? If so, does that mean that every WebXR page I ever visit has a persistent gaze cursor because the remote is paired (even though it's stuck in a drawer somewhere).
Nell, Alex, and I had previously discussed this scenario and determined that the best course of action was to recommend that apps "mode switch" between gaze cursors and tracked pointers as necessary depending on the last device you received input from (you can see this in the previous explainer.) But I'd like to gather some more opinions on that, as it seems easy to get wrong. It does seem, however, like this may be inavoidably something that needs to be left up to the web app to sort out, and so it's our responsibility to give them enough info to make an informed decision. So what does that look like?
First off, I suggested earlier that we should indicate whether or not the positional element of the the gripMatrix
(if available) is emulated. I still think that's a decent idea, and a good way to denote the difference in capabilities between 6DoF and 3DoF without being overly reliant on those terms (because they may not always be strictly accurate.)
Next, up thread it was suggested that the presence of a non-null gripMatrix
could imply which controllers should be rendered and which shouldn't, which seems sensible on the face of it but I'm still a little concerned about the hand tracking scenario. Specifically: HoloLens can track hands, but the default behavior upon air tap is to select based on a gaze cursor (please correct me if I'm wrong, MS folks!) So differentiating between gaze/pointer input purely based on the presence of a gripMatrix
would either cause HoloLens to look like a more traditional controller up until the point the user air taps, or suppress a potentially useful piece of data (hand position).
So perhaps we should have an attribute that indicates where the pointerMatrix
originates from? I can see the HoloLens scenario giving both a grip and pointer matrix, but having the pointer matrix originate at your head and follow your gaze because when the select event is fired that's what it will report It's not reasonable though, in my opinion, to make developers constantly check the origin of the pointerMatrix
to try and infer a relationship with the device pose, though, so a simple enum stating pointerOrigin: head
or similar feels appropriate.
Revised IDL
So with all that, I'd offer that maybe the IDL for XRInputSource
should look like this instead:
enum XRHandedness {
"",
"left",
"right"
};
enum XRPointerOrigin {
"head",
"grip",
"screen" // Input sources with this origin won't show in the array
};
interface XRInputSource {
readonly attribute XRHandedness handedness;
readonly attribute XRPointerOrigin pointerOrigin;
readonly attribute boolean emulatedPosition;
};
TODO
Still am not quite sure what the right way is to handle traditional gamepads, but if they do show up as XRInputSource
s I still think all of the above still applies cleanly to them.
I assume, this is intentional, but are we thinking about haptic support for V1?
In order to keep input in WebVR accessible, we should consider creating a fallback to Gamepad API input mapping. 6dof controllers like Oculus Touch "are inherently not accessible", said Josh Straub, Editor-In-Chief, D.A.G.E.R System (@DAGERSYSTEM) at OC4
Suggesting basic standardized gestures exposed through the API will guide developers to consider platform independent interaction and will empower a broader range of users and gamepad/controller makers/modders to be part of the industry. Here are links to organizations that create custom controllers for special needs children and people with physical disabilities.
I presented an approach of creating a "lowest common denominator" for input controls in VR at the W3C Authoring in VR Workshop in Brussels two months ago - here is a link to my presentation
I am not seeing a fallback pattern across all the different input variants, but maybe I am missing something?
Question about 3DOF controllers, such as Daydream or GearVR controller. Should the elbow model be applied by the WebXR implementation (by the browser)? Do we need to distinguish between 6DOF controllers and 3DOF controllers somehow? Also there is no way to get a name for the controller (probably, due to fingerprinting issue), meaning we can't even render the proper mesh for it, am I correct?
@Artyom17: I'm proposing that yes, we do provide an elbow model for 3DoF controllers, and that we indicate it with the emulatedPosition
boolean on the XRInputSource
. We would want to introduce a constraint that indicates the elbow model must only affect the translation, not the rotation, so that it's easy to strip out of the pose if the developer needs to. That shouldn't be problematic based on what I know of the Daydream and GearVR models, though. (emulatedPosition
would also become the way to differentiate between 3DoF and 6DoF).
And yes, I haven't included a controller name here. I'm happy to discuss if we need one or not, but I'm leaning towards avoiding them if we can. Fingerprinting is a concern, yes. (And it gets a bit silly if we say "I won't tell you the name of the headset, but it's controller is called an 'Oculus Touch'".) More than that, though, I saw LOTS of WebVR apps that did checks for controllers containing, for example, the string "Vive" in them and would ignore everything else. That's definitely not a pattern we want to encourage. It does prevent developers from looking up the right mesh, though.
I think we could get by at first by having our samples use some sort of generic VR controller mesh that's not obviously representative of any particular hardware, and encouraging developers to use something that's contextually relevant. (Like a remote control for video players, a gun or sword for action games, or a paintbrush for art apps.) Extending the API to include a way to return meshes would be really good in the future, but I no longer see it as critical (or practical) as part of v1.
@toji I am still not sure how that supposed to work without a way to identify the controller(s). All of them are different with different shape, sets of buttons, triggers, joysticks, etc. Even just to make a instruction screen for a hypothetical game, where you'll try to explain which buttons to press: how that supposed to work? List all the possible controllers?
And yet: Gamepad API has the 'id' of the controller and if we continue to use it then our worries about fingerprinting are not justified...
Speaking of handling buttons, triggers, etc: seems like the current proposal limits choice to a single trigger ('select'), am I right? Isn't it gonna be a downgrade from the current state?
Some thoughts
arm model would ideally be taken from system (e.g. GearVR or Daydream) for consistency with non-Web apps - and then passing through browser
A-Frame started with generic controller, and IMO it was quickly replaced by specific controller models both for consistency with non-Web apps, and for user comprehension, as @artyom17 alludes to above.
supporting real apps need more than a single select action where the systems teach users conventions and combinations, IMO. For current 3DOF one might argue for at least select and back / "menu"... for current 6DOF one might argue for select (trigger), grip and back / "menu" but there is usually a fourth at least which is stick or pad, especially for teleporting.
maybe per-site user permissions would be needed before providing the equivalent of Gamepad API information to avoid fingerprinting concerns, but as a practical matter, users would likely be forced to accept, and that may represent an undesirable point of friction entering VR experiences... and if detail such as grip matrix is freely provided, that may be just as problematic for fingerprinting anyway.
@Artyom17: Yes, this would be a deliberate downgrade from the amount of raw data we exposed previously (though making use of that data was hard to do in a generalized way.) The intent would not be to leave it at that, though. We'd want to extend the API soon to expose as much of the full input state as possible, but it would be really good to know how some of the backing native APIs are actually going to work before committing to anything.
@machenmusik: Agreed the arm model should be lifted directly from the native API whenever possible. As for your concerns about more robust input needs I do agree, but it's difficult to do that in a way that's not immediately exclusionary to new hardware. The A-Frame issue you referenced this one in is a perfect example. Oculus Go should be able to be treated as a drop-in replacement for GearVR, with maybe a different controller model. But because A-Frame relies on the controller name to get accurate input mapping it can only support new hardware when the developers specifically add support for them. This isn't a huge deal for something with the expected popularity of a new Oculus headset, but for the long tail of hardware that the A-Frame devs never get their hands on it's not realistic. We'd like to establish a universal baseline first, then add more in-depth capabilities.
(Totally agree that the need for accurate controller models is a big deal, BTW. I don't want to make it sound like I don't care. I just don't see a good way to do it as part of the version 1 API unless we expose controller name strings, and that immediately leads to the hardcoding behavior we're trying so hard to avoid.)
Thanks @toji.
w.r.t. more robust input needs, I do understand the balance with works-by-default, but I am not sure that having only one prescribed button is enough expressiveness.
Looking at the various controllers out there, the only combination that can be said to have one "button" (when it works) is Cardboard, and that isn't actually a controller; from Rift to Vive to WinMR to GearVR to Daydream, every single 3DOF/6DOF controller has at least three, one of which is reserved for system menu, and one of which is intended for app menu and/or back.
Perhaps the browser should allow the app menu button to be used by the Web app, while reserving some behavior to ensure its menus can be invoked when needed, and the generic description can expose two actions rather than one.
If we were trying to define a basic generic set for XR controllers, off the top of my head I'd propose something like this:
This would allow one to support minimal state, but allow a little more expressiveness from capable controllers without diving into customizations.
I suspect that pointerOrigin (gripMatrix) and emulatedPosition may prove to be enough to crudely distinguish models if fully implemented, although that is probably bad news from a fingerprinting perspective.
FYI, for anyone that's been participating in this conversation: A related pull request for the explainer has been available for a week and a half over at #325, so please take a look if you haven't already.
The basic input PR has now been merged, so closing this. If you have more specific issues with the input system please file them as new issues.
Background
A good chunk of this is, primarily, a re-focusing of the previously produced Input Explainer. As a result nothing here should really be surprising.
For those who don't already know, we are unable to continue with the design put forward in that document because of concerns about compatibility with upcoming (and as of this writing unreleased) VR/AR standards. I will not dive into a comprehensive evaluation of those incompatibilities, due the said unreleasedness of the standard in question, but will broadly note that it's not known at this time if it will allow the full input device state to be queried in the manner the previous explainer would require.
Given that, and given that we would prefer to have users begin using the WebXR Device API as soon as is reasonable without being blocked on third parties, I propose that we re-focus on exposing a minimal but broadly compatible subset of the previously discussed functionality and have some clear ideas of how it could evolve to fit a variety of underlying input systems in the future.
Requirements
The "simple" proposal from the previous explainer was one that just allowed developers to listen for basic point-and-click events from an source of VR input, which is enough to enable basic button-based UIs. This is "good enough" for video players, galleries, some simple games, etc. It is insufficent for more complex uses like A-Painter style art apps, complex games, or really anything that involves direct manipulation of objects.
That's regrettable, but a limitation that I feel is worth accepting for the moment in order to enable the significant percentage of more simplistic content that we see on the web today.
So, what we need to enable that level of input is:
I would also propose that, since this would be all we offer initially, we make this just a teensy bit more useful and future-proof by adding:
This would allow a bit more nuance in the interactions allowed by the system, giving the option to drag items around, for example.
Proposal
I find it easier to talk about these things when looking at an interface, so I'll start with a proposed IDL:
Tracking and rendering
Let's dive into tracking first, since it's relatively straightforward.
xrSession.getInputDevices()
returns a list of any tracked controllers. This does not include the users head in the case of gaze tracking devices like Cardboard. By themselves these objects do basically nothing useful.On each event the user can iterate through the list and call
xrFrame.getInputPose(inputDevice[i], frameOfReference);
to get the pose of the input in the given coordinate system and synced to the head pose delivered by the same frame. This can be used to render some sort of input representation frame-to-frame. (Note: I'm not including anything that describes a controller mesh for practicality reasons. We can investigate that later. In the meantime apps will just have to use app specific or generic resources.)The input device would be rendered using the
gripMatrix
, as that's what should be used to render things that are held in the hand.Pointers are a little more subtle. We want to render a ray coming off the controllers in many cases, but not the users head. However if the device is gaze based we do still want to draw a gaze cursor, and if the session is a magic window context we don't want to draw any cursor at all. So a bit of logic is needed to handle that. When pointers are drawn they should be drawn using the
pointerMatrix
, which may differ from thegripMatrix
for ergonomic reasons.The basic pattern ends up looking like:
I'd expect that we'll get a Three.js library real quick that adds simple controller visualization to your scene and does all the right things in this regard.
Primary input events
Handling primary input events is the other half of this proposal. A quick recap of what that means, copy-pasted from the previous explainer:
To listen for any of the above the developer adds listeners for the "select", "selectstart", or "selectend" events. When any of them fire the event will supply an
XRPresentationFrame
that's used to query input and head poses. The frame will not contain any views, so it can't be used for rendering. It also provides anXRInputSource
that represents the input device that generated the event. This may be one of the devices returned byxrSession.getInputDevices()
(in the case of a tracked controller) or one that's not exposed anywhere else (in the case of a headset button, air tap, or magic window touch)The exact interpretation of the pointer is dependent on the source that generates the event:
Use cases
The above capabilities give developers enough to handle the following (non-comprehensive) scenarios:
Obviously we'd like to enable more robust usage, but this does allow a pretty wide range of apps in the most broadly compatible way we can manage.
Future directions
So that's the extent of the current proposal, but it's good to have an idea of how we could extend it in the future. A few thoughts on that:
The current Gamepad API maintainers would like us to continue using it in conjunction with VR, and have expressed a willingness to refactor the API if necessary to make it more generally useful. If we wanted to go that direction (and were confident we could map it to all relevant native APIs) I would propose that we either expose
Gamepad
objects on theXRInputSource
or makeXRInputSource
inherit fromGamepad
(We would drop the pose extensions anddisplayId
).But if that's not practical, which is a very real possibility, my general line of thinking is to add a way to query inputs by name or alias to receive back an object that can be used both for state polling of that element and input event listening. Something like this:
Which could then be used like so to get the same effect as the "select" event documented earlier.