toji commented 6 years ago

Background

A good chunk of this is, primarily, a re-focusing of the previously produced Input Explainer. As a result nothing here should really be surprising.

For those who don't already know, we are unable to continue with the design put forward in that document because of concerns about compatibility with upcoming (and as of this writing unreleased) VR/AR standards. I will not dive into a comprehensive evaluation of those incompatibilities, due the said unreleasedness of the standard in question, but will broadly note that it's not known at this time if it will allow the full input device state to be queried in the manner the previous explainer would require.

Given that, and given that we would prefer to have users begin using the WebXR Device API as soon as is reasonable without being blocked on third parties, I propose that we re-focus on exposing a minimal but broadly compatible subset of the previously discussed functionality and have some clear ideas of how it could evolve to fit a variety of underlying input systems in the future.

Requirements

The "simple" proposal from the previous explainer was one that just allowed developers to listen for basic point-and-click events from an source of VR input, which is enough to enable basic button-based UIs. This is "good enough" for video players, galleries, some simple games, etc. It is insufficent for more complex uses like A-Painter style art apps, complex games, or really anything that involves direct manipulation of objects.

That's regrettable, but a limitation that I feel is worth accepting for the moment in order to enable the significant percentage of more simplistic content that we see on the web today.

So, what we need to enable that level of input is:

An event that fires when the input's primary interaction method occurs. Important that this be treated as a user activation event.
Frame-to-frame tracking of the input ray

I would also propose that, since this would be all we offer initially, we make this just a teensy bit more useful and future-proof by adding:

Notifications of starting and ending the primary interaction method. (ie: Button down/button up)

This would allow a bit more nuance in the interactions allowed by the system, giving the option to drag items around, for example.

Proposal

I find it easier to talk about these things when looking at an interface, so I'll start with a proposed IDL:

enum XRHandedness {
  "",
  "left",
  "right"
};

interface XRInputSource {
  readonly attribute VRHandedness handedness;
};

interface XRInputPose {
  readonly attribute Float32Array? gripMatrix;
  readonly attribute Float32Array? pointerMatrix;
};

//
// Extensions to existing interfaces
//

// Aside: I really think we should consider renaming this to just XRFrame
partial interface XRPresentationFrame {
  XRInputPose? getInputPose(XRInputSource inputSource, XRCoordinateSystem coordinateSystem);
};

partial interface XRSession {
  attribute EventHandler onselect;
  attribute EventHandler onselectstart;
  attribute EventHandler onselectend;

  attribute EventHandler oninputdeviceschange;

  FrozenArray<XRInputSource> getInputDevices();
};

//
// Events
//

[Constructor(DOMString type, XRInputSourceEventInit eventInitDict)]
interface XRInputSourceEvent : Event {
  readonly attribute XRPresentationFrame frame;
  readonly attribute XRInputSource inputSource;
};

dictionary XRInputSourceEventInit : EventInit {
  required XRPresentationFrame frame;
  required XRInputSource inputSource;
};

Tracking and rendering

Let's dive into tracking first, since it's relatively straightforward. xrSession.getInputDevices() returns a list of any tracked controllers. This does not include the users head in the case of gaze tracking devices like Cardboard. By themselves these objects do basically nothing useful.

On each event the user can iterate through the list and call xrFrame.getInputPose(inputDevice[i], frameOfReference); to get the pose of the input in the given coordinate system and synced to the head pose delivered by the same frame. This can be used to render some sort of input representation frame-to-frame. (Note: I'm not including anything that describes a controller mesh for practicality reasons. We can investigate that later. In the meantime apps will just have to use app specific or generic resources.)

The input device would be rendered using the gripMatrix, as that's what should be used to render things that are held in the hand.

Pointers are a little more subtle. We want to render a ray coming off the controllers in many cases, but not the users head. However if the device is gaze based we do still want to draw a gaze cursor, and if the session is a magic window context we don't want to draw any cursor at all. So a bit of logic is needed to handle that. When pointers are drawn they should be drawn using the pointerMatrix, which may differ from the gripMatrix for ergonomic reasons.

The basic pattern ends up looking like:

function drawInputs(xrFrame)
{
  let inputDevices = xrSession.getInputDevices();

  if (inputDevices.length) {
    // If input devices are reported always draw them with pointer rays/cursors.
    for (let inputDevice of inputDevices) {
      let inputPose = xrFrame.getInputPose(inputDevice, xrFrameOfRef);
      if (inputPose) {
        drawAController(inputPose.gripMatrix);
        drawAPointer(inputPose.pointerMatrix);
        drawACursor(inputPose.pointerMatrix, cursorDistance);
      }
    }
  } else if (xrSession.exclusive) {
    // Render a gaze cursor for exclusive sessions with no input devices.
    let devicePose = xrFrame.getDevicePose(xrFrameOfRef);
    drawACursor(devicePose.poseModelMatrix, cursorDistance);
  }

  // Render nothing for non-exclusive sessions with no input devices.
}

I'd expect that we'll get a Three.js library real quick that adds simple controller visualization to your scene and does all the right things in this regard.

Primary input events

Handling primary input events is the other half of this proposal. A quick recap of what that means, copy-pasted from the previous explainer:

The exact inputs that trigger these events are controlled by the UA and dependent on the hardware that the user has. For example, to trigger a "select" event on a variety of potential hardware:

On Daydream, the user would click the controller touchpad.

On HoloLens, the user would perform a tap with their index finger or say the system "Select" keyword.

On a Vive controller, Oculus Touch or Windows MR controller, the user would pull the trigger.

On Cardboard the user would press the headsets button.

To listen for any of the above the developer adds listeners for the "select", "selectstart", or "selectend" events. When any of them fire the event will supply an XRPresentationFrame that's used to query input and head poses. The frame will not contain any views, so it can't be used for rendering. It also provides an XRInputSource that represents the input device that generated the event. This may be one of the devices returned by xrSession.getInputDevices() (in the case of a tracked controller) or one that's not exposed anywhere else (in the case of a headset button, air tap, or magic window touch)

xrSession.addEventListener("select", onXRSelect);

function onXRSelect(event) {
{
  let inputPose = event.frame.getInputPose(event.inputSource, xrFrameOfRef);

  if (inputPose) {
    // Ray cast into scene with the pointer to determine if anything was hit.
    let selectedObject = scene.rayPick(inputPose.pointerMatrix);
    if (selectedObject) {
      onObjectSelected(selectedObject);
    }
  }
}

The exact interpretation of the pointer is dependent on the source that generates the event:

For tracked controllers it's a ray originating at the tip of the controller.
For gaze cursors it's a ray originating at the center of the users head and pointing in the direction of their gaze.
For magic window clicks the ray would originate at the graphics near plane and project out from directly under the cursor/touch point.

Use cases

The above capabilities give developers enough to handle the following (non-comprehensive) scenarios:

Dance Tonite-style passive viewing experiences
Video players (clicking buttons)
Image Galleries (clicking to expand, click and drag to scroll. No touchpad scrolling)
Matterport-style navigation (click to teleport)
Partial SketchFab-style use (click to teleport, but no touchpad scaling)
Simple shooter games (would need to use in-world buttons to switch weapons)
Simple painting apps (would need to just assume presence of 6DoF controllers, tools changes would need to use in-world UI)

Obviously we'd like to enable more robust usage, but this does allow a pretty wide range of apps in the most broadly compatible way we can manage.

Future directions

So that's the extent of the current proposal, but it's good to have an idea of how we could extend it in the future. A few thoughts on that:

The current Gamepad API maintainers would like us to continue using it in conjunction with VR, and have expressed a willingness to refactor the API if necessary to make it more generally useful. If we wanted to go that direction (and were confident we could map it to all relevant native APIs) I would propose that we either expose Gamepad objects on the XRInputSource or make XRInputSource inherit from Gamepad (We would drop the pose extensions and displayId).

But if that's not practical, which is a very real possibility, my general line of thinking is to add a way to query inputs by name or alias to receive back an object that can be used both for state polling of that element and input event listening. Something like this:

interface XRInputAction extends EventTarget {
  attribute EventHandler onchange;
  attribute EventHandler onclick;

  readonly attribute boolean pressed;
  readonly attribute boolean touched;
  readonly attribute double  value;
  readonly attribute double? xAxis;
  readonly attribute double? yAxis;
};

partial interface XRInputSource {
  XRInputAction? getAction(DOMString action);
};

interface XRInputActionEvent : Event {
  readonly attribute XRPresentationFrame frame;
  readonly attribute XRInputSource inputSource;
  readonly attribute XRInputAction inputSource;
};

Which could then be used like so to get the same effect as the "select" event documented earlier.

let selectAction = xrInputSource.getAction("select");
if (selectAction) {
  selectAction.addEventListener("click", onXRSelect);
}

TrevorFSmith commented 6 years ago

Moving the interpretation of common higher level semantic events down into the user agent is clearly necessary to support the creation of XR apps that will work across the many existing platforms and going forward on new platforms.

One thing that's not clear to me is how we pick the small set of initial semantic events and how we extend that set in the future. How do we draw the line between the initial set of events and the events that also feel somewhat inevitable, like secondary selection?

kearwood commented 6 years ago

This is looking really good -- I feel it captures well most of the concepts from our earlier iterations and meetings.

I have just a couple small suggestions:

Add an onselectcancel event handler.

In the case when a user begins a gesture (ie pressed a button down) and the content loses VR focus or positional tracking before the button is released, the user likely does not want the gesture to be acted upon. Rather than emitting a onselectend event, the browser could emit the onselectcancel event. Alternately, we could still fire onselectend, but include a "reason" attribute.

Could we include an "onback" event as well?

"out", "back", "undo", "exit", "home", "cancel" seem to be the second-most common interactions after "select" and could be handled by this event. Unlike select, this one would not have a raycast but would have onbackstart, onbackend, and onback events. Having start/end for this would enable long presses to go "home" while short pressed to go "back" for example.

dmarcos commented 6 years ago

It looks good to me. @toji How do you see the proposed XRInputSource API evolving to accommodate input data beyond buttons and axis like for instance a fully tracked hand or body suit?

kearwood commented 6 years ago

If the controller does not have any spare physical buttons to represent the "onback" event, it could recognize this in other ways, enabling consistency across all WebVR experience. Examples:

shake the controller rapidly
swipe on a trackpad
double-click the "select" button or double-tap the trackpad

toji commented 6 years ago

@TrevorFSmith: This is probably the question that's caused me more stress than anything else lately. The most straightforward answer I've come up with is that if we're concerned about the divide we can jump directly to the getAction approach I outlined in the "Future directions" section and simply start with only one action. In order to feel confident about that approach, though, I think we'd want to be confident that it was going to be the solution long term, and I'm simply not yet. That could change with some API iteration.

I brought this up with @bfgeek prior to posting this issue, and his suggestion was just to do the simple, straightforward thing for the moment and not overcomplicate things. "Primary action" is a fundamental enough concept that if it becomes the sole input boondoggle against a more robust future system it won't look too weird.

@dmarcos: My gut tells me that body suits are not applicable here. For one, there's no hardware in wide use that does it yet, and thus the interfaces to it are undefined. You could speculate it looks something like a Kinect skeleton, but that may not be the end product for a variety of reasons. Two, even if you have full body tracking it's unlikely that developers want to treat a full body as equatable to existing controllers, which are almost always tracking hands in one form or another. I can envision a system where XRSkeleton and XRInputSource live alongside eachother, feeding off of the same underlying data: One for tracking the user's full pose and one that abstracts their hands into a more traditional pointing device for compatibility.

As for hand tracking, which I view as a more specialized case, we did talk about that in the context of the previous explainer and the hope was that we could attach a nullable hand skeleton directly to the XRInputSource when applicable, which could feasibly represent either Leap Motion-style visual tracking or Oculus Touch-style pose estimation. Doesn't seem appropriate for a v1 feature, but I do like having an idea of where that data could be attached.

@kearwood: +1 to the idea of communicating that the input was canceled somehow, though I'm curious if it applies in any situation other than focus lost? Can't think of any at the moment. Also, I'm hesitant to create a bunch of separate events, I kind of like your suggestion for adding a reason, though I'd maybe say take it a step further and have a more general selectchange event that includes the current state. (That's edging closer the the "Future directions" suggestion, not sure how far down that rabbit hole I want to fall.)

As for things like back/home/exit/etc I'm more conflicted. They absolutely make sense for browsing mode, but in that case the UA should handle it entirely. In WebXR content it seems like a lot of them are going to be context-sensitive and we don't have a reasonable way of knowing the context at this time. Certainly we can't be guaranteed that there will be enough discreet buttons to cover all of the desired functions, and if we enforce certain gestures we end up limiting the apps input possibilities. For example: I'd hate to prevent any games that simulate swordplay because all of that rapid slashing was interpreted as a back gesture.

So I think that until we start dipping our toes into declarative land we should ensure that the developer is fully in control of how they interpret inputs and we simply surface them in as consistent a manner as possible. That said, making it easier to recognize certain common gestures (touchpad swipes, double clicks, etc) seems like a unquestionably good idea. I'm not sure if the browser can provide anything there that a well built library can't, though. (This is, of course, assuming we have made more of the input state available to the page)

AdaRoseCannon commented 6 years ago

This looks really good.

For the controllers you get a Matrix but if the controller only has rotation capabilities should the browser return only a Matrix with rotation component? Perhaps it should estimate position components based upon a shoulder/arm rig once it knows which hand the user is holding the controller in, what are your thoughts on this?

toji commented 6 years ago

@AdaRoseCannon: I agree that an arm model should implicitly be part of the matrix returned for 3DoF controllers. The only caveat is that we should ensure that the arm model doesn't alter the rotation of the controller, only it's offset. That way developers who care exclusively about the rotation (driving games, etc.) can trivially zero out the translation components and be left with an accurate orientation. We've confirmed that this is how the Daydream arm models work, and I would imagine that GearVR is the same.

That said, we should probably indicate when the position is emulated rather than sensor-based. That was in the original explainer but I took it out here for simplicity and because I hated the name. :wink: Maybe xrInputSource.positionEstimated?

As for which side the user's holding it on I'd expect the platform to know what the preferred hand is, so we should be able to just use that.

sompylasar commented 6 years ago

Looks good! I think the getAction approach is the most extensible and should be the only way to receive interaction events from the inputs. The method can in the future return more than one type of actions (interfaces) to allow variety of types of input with incompatible sets of attributes without requiring to extend the core input source interface.

What I'm concerned about is how this proposal addresses the input mode where the input is continuous, not a sequence of discrete events (actions, gestures).

kearwood commented 6 years ago

@toji re: +1 to the idea of communicating that the input was canceled somehow, though I'm curious if it applies in any situation other than focus lost?

Some other situations:

controller moved out of tracked volume / safety-guardian space
User pressed button on second controller, pointing at different target
6dof controller lost tracking
HMD lost tracking
HMD was removed during gesture
Another non-webvr gesture was recognized using the same physical controls (ie. very long press of button or additional button pressed at the same time)
Lost communication with controller (ie. wireless interference, battery depleted)

AlbertoElias commented 6 years ago

Really like this. Big +1 to the arm model, and as long as v1 is extensible to all possible buttons on an input device, I think it's great

joshmarinacci commented 6 years ago

I like this proposal. I'm fine with adding it to the GamePad API, though that API needs some work as well (which I've been digging into lately). I'm confused about how the developer can tell which sort of device the user currently has, and therefore if they should draw the gaze cursor and other interaction overlays. Does the developer look for the existence of the gripMatrix vs pointerMatrix objects?

toji commented 6 years ago

I'm confused about how the developer can tell which sort of device the user currently has, and therefore if they should draw the gaze cursor and other interaction overlays.

This was definitely under-specified in my above text, and I'm not sure I'm totally satisfied with the implied proposal I made.

The code sample shows that if there's no tracked controllers then the device is implied to use a gaze cursor. That seems to fit for Cardboard-style devices and probably HoloLens uses (even though that does provide limited hand tracking, it doesn't expose hand position frame-to-frame AFAIK). However on further reflection this wouldn't really be appropriate for something like the GearVR's headset controls or the Oculus Remote. Those provide more than one-bit inputs, so presumably we'd like to expose them eventually, but are not actually tracked independently from the headset. I guess there's a few different attitudes we could take regarding those:

We should treat these like any other XR input device and expose them via xrSession.getInputDevices(), adding data to indicate their tracking capabilities (or lack thereof)
They're no different than a gamepad. We can expose them using the existing gamepad API and reserve xrSession.getInputDevices() for tracked devices only. select event could still fire.
That style of input is quickly disappearing, so we shouldn't consider it to be anything more than a fancy cardboard button.

judax commented 6 years ago

I like the proposal and agree with some of @kearwood comments to include the cancel event. Not sure about not using a different event handler though. I think it makes it more consistent (although I understand not wanting to swamp the API with handlers).

Some notes:

If the type is InputSource (I prefer InputDevice, but I guess there is an argument to some types of inputs not being devices per se?), then the call should be getInputSources right?
How will the app know what to render as the controller? For example, how does the app know that a Vive controller or an Oculus Touch controller should be rendered to match the type of device the user is handling? Or, as this is a high level abstraction that does not provide access to other buttons all the controllers should be rendered using a generic 3D model?

NellWaliczek commented 6 years ago

This is looking really good, @toji ! Thanks for pulling the key parts of the original proposal forward.

@judax yeah the intent with the naming is to also be representative of hand input (such as on HoloLens). Also, we had an optional glTF blob off of InputSource that maybe got missed somewhere in the shuffle. @toji , was that on purpose or can we just grab it and add it back in?

@kearwood As far as "cancel" goes, I agree that it's worth having!

It's great that we're taking a look at how we might integrate with gamepad going forward to help ensure our plan is solid. At the same time, let's be careful not to go too far down the road of designing the future solution when the whole point of this new propsoal is the address the fact that we aren't ready to do so yet ;)

kearwood commented 6 years ago

As per the Implementers' Call:

We should ensure that content can know when it should draw a controller or not, so that controllers are not drawn on top of your physical hands/controllers in an AR environment.

toji commented 6 years ago

I'd alter that statement a bit: We want to inform content about when it should draw controllers mostly to avoid having it draw a controller model for a gaze cursor. I'd argue that in an AR scenario (however it is that we identify that) we never want to draw the controller unless the intent is controller replacement. (And that's generally not something that can be done in a high quality way today). You would still likely want to draw the cursor/ray in an AR situation.

toji commented 6 years ago

Following up with some more notes from yesterday's call: As Kip mentioned, one of the questions floating around is how do we clearly communicate what the user needs to draw in each situation. I get the feeling from the conversations we've had that an implicit guideline of "If A, B, and C are true do this" may fall short of helping us provide consistent usage across sites.

Also, there's a question of what should show up in the array of returned devices. Tracked controllers seem like an obvious yes. I was under the impression that tracked hands might be a no, but Nell (who's in a position of more experience in that regard) says yes. There's differing opinions about whether or not a regular, non-tracked gamepad should show up in the list, with some suggesting that maybe we'd want a way to explicitly associate them, and it seems most people (but not everyone) think that a touchscreen for a mobile device magic window should not show up in the array, even though it would be generating events.

Input Variants

I think a useful excerise at this point is to list out as many different input examples as we can think of and propose how they would map to the system.

Scenarios I'm aware of (please respond with more if you can think of them):

6DoF Tracked controller (Vive, Rift, WinMR)
3DoF Tracked controller (Daydream, GearVR)
Tracked hands (Leap Motion, HoloLens)
Untracked VR-specifc input (GearVR trackpad, Oculus Remote, Cardboard Button, Bluetooth clicker)
Untracked Traditional Gamepad
Keyboard/Mouse
Vocal commands
Touchscreens

For 3/6DoF tracked controllers you generally want to render controllers, pointers, and cursors (referring only to simple selection mode. Obviously not for more specialized uses.) These obviously

Untracked VR inputs across the board should only render a cursor, since they're all implicitly functioning as triggers for a gaze cursor AFAICT. Same for gamepad and vocal commands. We also want to be considerate of the fact that these may end up being more complex than just single buttons.

Touchscreens are unique in that there is no frame-to-frame data to be had, and no cursors should be rendered. The only time we know what the input ray is going to be is when the users finger hits the canvas, and at that point rendering a cursor/pointer/controller is pointless because it's obscured. Thus nothing should be rendered for touchscreen inputs.

Mouse input is interesting, Windows Mixed Reality has some affordances for this. In their usage the mouse cursor actually runs along the virtual surfaces, which is obviously not possible for the level of API we're talking about to handle opaquely. In previous conversations about this I beleive we determined that if users wanted this behavior they could track mouse deltas themselves and cast a cursor into the world, probably while pointer locked. Otherwise the mouse should act more or less like the touchscreen scenario above.

Honestly I'm not sure about tracked hands. HoloLens generally treats them as ~= a bluetooth clicker, Leap Motion likes to think of them more like controllers. This is probably one of those situations where we really should be giving the user very direct guidance on platform conventions.

API representation

So taking the above into account, in my opinion the question of what shows up in the input devices array boils down largely into "do I need to visually represent this input each frame?" (Keeping in mind that the visual representation may simply be a gaze cursor) Which shakes out like this:

3/6DoF controllers: Yes
Untracked VR inputs: Yes
Gamepads: Maybe, leaning towards Yes
Hands: Depends on platform convention, but likely Yes.
Touchscreen/Mouse: No

So now comes the really interesting question: If I'm on an Oculus Rift with Touch controllers AND I have an Oculus Remote kicking around, does it still show up as a third input? If so, does that mean that every WebXR page I ever visit has a persistent gaze cursor because the remote is paired (even though it's stuck in a drawer somewhere).

Nell, Alex, and I had previously discussed this scenario and determined that the best course of action was to recommend that apps "mode switch" between gaze cursors and tracked pointers as necessary depending on the last device you received input from (you can see this in the previous explainer.) But I'd like to gather some more opinions on that, as it seems easy to get wrong. It does seem, however, like this may be inavoidably something that needs to be left up to the web app to sort out, and so it's our responsibility to give them enough info to make an informed decision. So what does that look like?

First off, I suggested earlier that we should indicate whether or not the positional element of the the gripMatrix (if available) is emulated. I still think that's a decent idea, and a good way to denote the difference in capabilities between 6DoF and 3DoF without being overly reliant on those terms (because they may not always be strictly accurate.)

Next, up thread it was suggested that the presence of a non-null gripMatrix could imply which controllers should be rendered and which shouldn't, which seems sensible on the face of it but I'm still a little concerned about the hand tracking scenario. Specifically: HoloLens can track hands, but the default behavior upon air tap is to select based on a gaze cursor (please correct me if I'm wrong, MS folks!) So differentiating between gaze/pointer input purely based on the presence of a gripMatrix would either cause HoloLens to look like a more traditional controller up until the point the user air taps, or suppress a potentially useful piece of data (hand position).

So perhaps we should have an attribute that indicates where the pointerMatrix originates from? I can see the HoloLens scenario giving both a grip and pointer matrix, but having the pointer matrix originate at your head and follow your gaze because when the select event is fired that's what it will report It's not reasonable though, in my opinion, to make developers constantly check the origin of the pointerMatrix to try and infer a relationship with the device pose, though, so a simple enum stating pointerOrigin: head or similar feels appropriate.

Revised IDL

So with all that, I'd offer that maybe the IDL for XRInputSource should look like this instead:

enum XRHandedness {
  "",
  "left",
  "right"
};

enum XRPointerOrigin {
  "head",
  "grip",
  "screen" // Input sources with this origin won't show in the array
};

interface XRInputSource {
  readonly attribute XRHandedness handedness;
  readonly attribute XRPointerOrigin pointerOrigin;
  readonly attribute boolean emulatedPosition;
};

TODO

Still am not quite sure what the right way is to handle traditional gamepads, but if they do show up as XRInputSources I still think all of the above still applies cleanly to them.

Artyom17 commented 6 years ago

I assume, this is intentional, but are we thinking about haptic support for V1?

rdub80 commented 6 years ago

In order to keep input in WebVR accessible, we should consider creating a fallback to Gamepad API input mapping. 6dof controllers like Oculus Touch "are inherently not accessible", said Josh Straub, Editor-In-Chief, D.A.G.E.R System (@DAGERSYSTEM) at OC4

Suggesting basic standardized gestures exposed through the API will guide developers to consider platform independent interaction and will empower a broader range of users and gamepad/controller makers/modders to be part of the industry. Here are links to organizations that create custom controllers for special needs children and people with physical disabilities.

I presented an approach of creating a "lowest common denominator" for input controls in VR at the W3C Authoring in VR Workshop in Brussels two months ago - here is a link to my presentation

I am not seeing a fallback pattern across all the different input variants, but maybe I am missing something?

Artyom17 commented 6 years ago

Question about 3DOF controllers, such as Daydream or GearVR controller. Should the elbow model be applied by the WebXR implementation (by the browser)? Do we need to distinguish between 6DOF controllers and 3DOF controllers somehow? Also there is no way to get a name for the controller (probably, due to fingerprinting issue), meaning we can't even render the proper mesh for it, am I correct?

toji commented 6 years ago

@Artyom17: I'm proposing that yes, we do provide an elbow model for 3DoF controllers, and that we indicate it with the emulatedPosition boolean on the XRInputSource. We would want to introduce a constraint that indicates the elbow model must only affect the translation, not the rotation, so that it's easy to strip out of the pose if the developer needs to. That shouldn't be problematic based on what I know of the Daydream and GearVR models, though. (emulatedPosition would also become the way to differentiate between 3DoF and 6DoF).

And yes, I haven't included a controller name here. I'm happy to discuss if we need one or not, but I'm leaning towards avoiding them if we can. Fingerprinting is a concern, yes. (And it gets a bit silly if we say "I won't tell you the name of the headset, but it's controller is called an 'Oculus Touch'".) More than that, though, I saw LOTS of WebVR apps that did checks for controllers containing, for example, the string "Vive" in them and would ignore everything else. That's definitely not a pattern we want to encourage. It does prevent developers from looking up the right mesh, though.

I think we could get by at first by having our samples use some sort of generic VR controller mesh that's not obviously representative of any particular hardware, and encouraging developers to use something that's contextually relevant. (Like a remote control for video players, a gun or sword for action games, or a paintbrush for art apps.) Extending the API to include a way to return meshes would be really good in the future, but I no longer see it as critical (or practical) as part of v1.

Artyom17 commented 6 years ago

@toji I am still not sure how that supposed to work without a way to identify the controller(s). All of them are different with different shape, sets of buttons, triggers, joysticks, etc. Even just to make a instruction screen for a hypothetical game, where you'll try to explain which buttons to press: how that supposed to work? List all the possible controllers?

And yet: Gamepad API has the 'id' of the controller and if we continue to use it then our worries about fingerprinting are not justified...

Speaking of handling buttons, triggers, etc: seems like the current proposal limits choice to a single trigger ('select'), am I right? Isn't it gonna be a downgrade from the current state?

machenmusik commented 6 years ago

Some thoughts

arm model would ideally be taken from system (e.g. GearVR or Daydream) for consistency with non-Web apps - and then passing through browser
A-Frame started with generic controller, and IMO it was quickly replaced by specific controller models both for consistency with non-Web apps, and for user comprehension, as @artyom17 alludes to above.
supporting real apps need more than a single select action where the systems teach users conventions and combinations, IMO. For current 3DOF one might argue for at least select and back / "menu"... for current 6DOF one might argue for select (trigger), grip and back / "menu" but there is usually a fourth at least which is stick or pad, especially for teleporting.
maybe per-site user permissions would be needed before providing the equivalent of Gamepad API information to avoid fingerprinting concerns, but as a practical matter, users would likely be forced to accept, and that may represent an undesirable point of friction entering VR experiences... and if detail such as grip matrix is freely provided, that may be just as problematic for fingerprinting anyway.

toji commented 6 years ago

@Artyom17: Yes, this would be a deliberate downgrade from the amount of raw data we exposed previously (though making use of that data was hard to do in a generalized way.) The intent would not be to leave it at that, though. We'd want to extend the API soon to expose as much of the full input state as possible, but it would be really good to know how some of the backing native APIs are actually going to work before committing to anything.

@machenmusik: Agreed the arm model should be lifted directly from the native API whenever possible. As for your concerns about more robust input needs I do agree, but it's difficult to do that in a way that's not immediately exclusionary to new hardware. The A-Frame issue you referenced this one in is a perfect example. Oculus Go should be able to be treated as a drop-in replacement for GearVR, with maybe a different controller model. But because A-Frame relies on the controller name to get accurate input mapping it can only support new hardware when the developers specifically add support for them. This isn't a huge deal for something with the expected popularity of a new Oculus headset, but for the long tail of hardware that the A-Frame devs never get their hands on it's not realistic. We'd like to establish a universal baseline first, then add more in-depth capabilities.

(Totally agree that the need for accurate controller models is a big deal, BTW. I don't want to make it sound like I don't care. I just don't see a good way to do it as part of the version 1 API unless we expose controller name strings, and that immediately leads to the hardcoding behavior we're trying so hard to avoid.)

machenmusik commented 6 years ago

Thanks @toji.

w.r.t. more robust input needs, I do understand the balance with works-by-default, but I am not sure that having only one prescribed button is enough expressiveness.

Looking at the various controllers out there, the only combination that can be said to have one "button" (when it works) is Cardboard, and that isn't actually a controller; from Rift to Vive to WinMR to GearVR to Daydream, every single 3DOF/6DOF controller has at least three, one of which is reserved for system menu, and one of which is intended for app menu and/or back.

Perhaps the browser should allow the app menu button to be used by the Web app, while reserving some behavior to ensure its menus can be invoked when needed, and the generic description can expose two actions rather than one.

If we were trying to define a basic generic set for XR controllers, off the top of my head I'd propose something like this:

mandatory: select (trigger on everything except Daydream which would need to use trackpad)
optional but highly recommended: app (the app menu / back button from above)
optional: 2D axis pair (trackpad and/or thumbstick; only tricky case is Daydream where select must be done with trackpad)

This would allow one to support minimal state, but allow a little more expressiveness from capable controllers without diving into customizations.

I suspect that pointerOrigin (gripMatrix) and emulatedPosition may prove to be enough to crudely distinguish models if fully implemented, although that is probably bad news from a fingerprinting perspective.

toji commented 6 years ago

FYI, for anyone that's been participating in this conversation: A related pull request for the explainer has been available for a week and a half over at #325, so please take a look if you haven't already.

toji commented 6 years ago

The basic input PR has now been merged, so closing this. If you have more specific issues with the input system please file them as new issues.

immersive-web / webxr

Simplified V1 Input Proposal #319

Background

Requirements

Proposal

Tracking and rendering

Primary input events

Use cases

Future directions