Advanced WebXR Controller Input

Given feedback from multiple developers citing concerns about our initial plans for a limited input system, the WebXR community group wanted to re-open a discussion about how to handle more advanced input. We agreed on a recent call to put together a proposal for how such a system might work so that we can iterate on the design in public and gather feedback from relevant parties (such as the OpenXR working group and the W3C Gamepad CG).

With WebVR we had exposed VR controller state with extensions to the gamepad API. We feel that building this new advanced input on top of the Gamepad API as it exists today is problematic, though, for a couple of reasons:

Developers frequently complained about the difficulty mapping multiple controllers using that method, and as a result we saw a lot of experiences that excluded all controllers that didn't have a specific name string.
OpenXR's proposed input model makes it effectively impossible to enumerate all available inputs on a controller (or, for that matter, all available controllers)
XR controllers appear to be converging on a small set of common layouts relatively quickly, which should allow us to expose something more limited in scope than the generic gamepad API with higher confidence than usual that it'll be relevant for multiple years to come.

That said, we don't want to re-invent wheels that we don't have to, so we're open to further discussing this proposal with the individuals maintaining the Gamepad API at the moment to see if there's a common ground that can be reached that isn't deterimental to this use case.

Proposed IDL

interface XRInputState {
  readonly attribute boolean pressed;
  readonly attribute boolean touched;
  readonly attribute FrozenArray<double> value;
};

interface XRControllerState {
  readonly attribute XRInputState? trigger;
  readonly attribute XRInputState? joystick;
  readonly attribute XRInputState? touchpad;
  readonly attribute XRInputState? grip;
  readonly attribute FrozenArray<XRInputState> buttons;
};

partial interface XRInputSource {
  readonly attribute XRControllerState? controllerState;
};

And some really brief sample code:

let inputSource = xrSession.getInputSources()[0];

if (inputSource.controllerState) {
  // Is a controller with buttons and stuff!

  let joystick = inputSource.controllerState.joystick;
  if (joystick && joystick.value.length == 2) {
    // Has a 2-axis joystick!
    PlayerMove(joystick.value[0], joystick.value[1]);
  }

  if (inputSource.controllerState.buttons.length > 0) {
    let button = inputSource.controllerState.buttons[0];
    if (button.pressed) {
      PlayerJump();
    }
  }

  // etc.
}

These snippets are not intended to be taken verbatim, but are meant to serve as a concrete starting point for further conversation.

Mapping to native APIs

One of our primary concerns when structuring this API is ensuring that it can work successfully on top of OpenXR, which we expect to power a non-trivial amount of the XR ecosystem at some future date. As explained in the Khronos group's GDC session, that API is currently planning on exposing an input system that revolves around binding actions to recommended paths (ie: "/user/hand/left/input/trigger/click"), which the system can then re-map as needed.

We would expect the above interface to be implented on top of an OpenXR-like system by creating actions that are a fairly literal description of the expected input type and map it to the "default" paths for that input type when available. (ie: controllerState.trigger.value[0] is backed by an action binding called "triggerValue" with a default path of "/user/hand//input/trigger/value") If the binding fails that particular input is set to null to indicate it's not present. This is a fairly ridgid use of the binding system, but does allow users to do some basic, browser-wide remapping when needed.

For any other native API, the inputs are usually delivered as a simple struct of values, which are trivial to map to this type of interface.

Random thoughts/comments

This does not replace the existing "select" events and it's variants, but instead is additive on top of that. We still expect many apps will want to use the simple "select"-based input method for the broadest possible compatibility, whereas thie proposal is only applicable to systems that use controller devices.
The value array maps well to OpenXR's idea of actions as vectors. I like that the length makes it easy to test what kind of value you're dealing with: 0 == boolean, 1 == scalar, 2+ == vector
value elements should all be normalized to a [-1, 1] range where 0 is the neutral value
You could feasibly map something like Knuckles finger tracking as a vector of 5 values in the grip input (1st value gives overall grip value for backwards compat.)
controllerState name is very intentionally exclusive of things that are not controller-shaped. I would expect alternative tracked inputs like hands to ommit the controller state altogether and rely solely on select events for now. I think if we want more than that we'll need hand-centric input state.
Following the previous item, and as we've discussed before, I think explicit hand skeletal poses are something that could exist as a separate entity on the XRInputSource, I like the idea of keeping those kinds of states bundled together in discreet, easily testable interfaces that live side-by-side under the input source.
For OpenXR action mappings in relation to the button array we'd probably just end up speculatively binding to a primary and secondary button (or A and B or whatever the path ends up being for the known controllers) and see if it takes. Not great, but functional. We would want to declare a way to determine the ordering in the spec, by the way. (Closest to the user's palm first or something)
I'm not advocating adding state change events to this model at this time, but we may want to do so in the future.

Questions

Can you detect that a joystick is clickable? Do you need to?
Some devices can detect how far a finger is off the input rather than just touched/not. Is that something we care about exposing here?
Do we need to declare if a value has a range of [-1, 1] or [0, 1]? Is it implied by the number of values? By the input name?
Do we care about haptics in this first pass? I'm leaning towards no for simplicity, but could easily be convinced otherwise.

It's looking great Brandon, thanks for sharing it. Just few comments:

readonly attribute XRInputState? trigger; readonly attribute XRInputState? joystick; readonly attribute XRInputState? touchpad;

I know that all the controllers I have seen so far has just one of these, but do you think it could be useful to define them as an array as you do with buttons so we could support new fancy controllers in the future? triggers[0] will still be the default trigger, but who knows maybe you could have an additional trigger for the middle finger or so.

Can you detect that a joystick is clickable? Do you need to?

I believe it's valuable to have that feature, both for joystick and touchpad. I expect that it will works as it does right now am I correct? I mean If you are moving the joystick or using the touchpad without pressing, you'll get the axis values on the array of values and touched will be true, but pressed false, and once you click on it, you'll also set pressed to true. Regarding detecting if it's clickable or not, what about adding a third value to the array. If we assume all the joysticks and touchpad has 2 axis (For 1 axis buttons like LT or RT on the xbox controller we will use the buttons array anyway), we could add a third one indicating that it's clickable and it will contains 0 or 1 depending on the state.

Do we need to declare if a value has a range of [-1, 1] or [0, 1]? Is it implied by the number of values? By the input name?

Currently we're using touchpads and joysticks values as [-1, 1] and analog buttons like triggers with [0, 1]. If we have boolean buttons with values 0 or 1, I'd expect analog buttons to be on the same range too. Otherwise we should provide a way to detect if the button is analog or digital so you could map the [-1,1] range back to [0,1]. Otherwise you could end up mapping digital 0 & 1 to 0.5 and 1.

Do we care about haptics in this first pass? I'm leaning towards no for simplicity, but could easily be convinced otherwise.

I'm happy to skip that too on the first iteration.

The value array maps well to OpenXR's idea of actions as vectors. I like that the length makes it easy to test what kind of value you're dealing with: 0 == boolean, 1 == scalar, 2+ == vector

I know that we always try to define the APIs as simple as possible and remove any extra fancy feature that we can infer from it in order to simplify it. But I would like to hear opinions about adding explicit definitions for things like this when detecting the kind of value, or as my previous proposal for the "is clickable" value. As if we end up realizing that everyone using the API will need to do that type of tests by themselves maybe it could be worth to include an specific param for that?

Thanks again for kicking off this!

Link for the lazy : In the video @toji linked, Nick Whiting talks about input handling in OpenXR at https://youtu.be/U-CpA5d9MjI?t=28m15s until ~36:40.

We would expect the above interface to be implented on top of an OpenXR-like system by creating actions that are a fairly literal description of the expected input type and map it to the "default" paths for that input type when available.

I am concerned that this approach removes the benefits of the OpenXR-like input system. As an application developer, the device-abstraction provided by OpenXR allows me to write an application against named, typed actions. I want to replace code written like this:

let button = inputSource.controllerState.buttons[0];
if (button.pressed) {
  PlayerJump();
}

with code written like this:

// In initialization code
const jumpActionHandle = Input.idForAction("/foo_platforming/in/jump");

// In game loop
if (Input.getBoolean(jumpActionHandle)){
  PlayerJump();
}

I would love to write my web application as if it were an OpenXR application. However, there are problems with this, stemming from the fact that the browser is the OpenXR application, not the web app. Here are some of those problems and potential solutions. I'd love to hear feedback on these.

The user wants to customize bindings on a per-(web)application basis.
The web application developer wants to define action sets and write logic against actions.
The browser must interface with OpenXR to define all possible actions.

One way I imagine these all to be satisfied is as follows:

When the browser starts, it gives the runtime an action manifest for all of its own actions:

{
  name: "/browser_default/in/pointer_position",
  type: "vector3"
},
{
  name: "/browser_default/in/pointer_orientation",
  type: "quaternion"
},
{
  name: "/browser_default/in/pointer_click",
  type: "boolean"
},
{
  name: "/browser_menu/in/up",
  type: "boolean"
},
{
  name: "/browser_menu/in/down",
  type: "boolean"
},
{
  name: "/browser_menu/in/select",
  type: "boolean"
},
{
  name: "/browser_menu/in/cancel",
  type: "boolean"
},

This action manifest would allow the browser (being an OpenXR application), to allow the user to click on things in the browser chrome and open menus and such. A real example would include more than this. Here I wanted to show that the browser would specify the actions it needs to operate in two different action sets ("browser_default" and "browser_menu").

When the browser navigates to a web application ("Foo"), the web application signals to the browser that it wants to use OpenXR-like input. The web app supplies its action manifest, which the browser composes with its own. The browser then reregisters its actions with OpenXR, potentially appearing as a different application (It goes from being called "Browser" to "Browser-Foo" in OpenXR). I don't know if this is possible, or if there's some way to make this possible! The browser submits a new action manifest to OpenXR:

{
  name: "/browser_default/in/pointer_position",
  type: "vector3"
},
...
{
  name: "/browser_menu/in/cancel",
  type: "boolean"
},
{
  name: "/foo_menu/in/up",
  type: "boolean"
},
{
  name: "/foo_menu/in/down",
  type: "boolean"
},
{
  name: "/foo_menu/in/select",
  type: "boolean"
},
{
  name: "/foo_menu/in/cancel",
  type: "boolean"
},
{
  name: "/foo_platforming/in/jump",
  type: "boolean"
}
...

What I don't know is whether the browser can appear as a different application to the runtime for each web application that is running inside / on top of it. If there's a way to do that, then users can store/share their favorite bindings for web applications the same way they do for native applications. There are some obvious hurdles to this approach, assuming it's possible. For example, if switching tabs back and forth unregisters and reregisters the browser with OpenXR, then OpenXR handles to actions held by the web application may be invalidated, and there may be significant overhead in doing the switch.

If there's not a way to do that, that is unfortunate. Perhaps the browser can to store this information on behalf of the user and suggest bindings to OpenXR whenever the user navigates to a new page, but this seems like a MUCH more difficult arrangement to pull off.

In summary, I'm concerned that writing against inputSource.controllerState.buttons[0] will 1) leave users in a place where they store N custom bindings for their browser, one for each web application they use. 2) not allow developers to write "input agnostic" code and leaves them to figure out and implement a pattern of using actions / action sets on their own.

If we can find a way to share the benefits of the OpenXR runtime input system (_or an OpenXR-like input system), that would be ideal.

From that perspective, potential answers to your questions:

Can you detect that a joystick is clickable? Do you need to?

The application specifies the need for a boolean action. The user can bind that action to the joystick click via the runtime.

Some devices can detect how far a finger is off the input rather than just touched/not. Is that something we care about exposing here?

Handled by the runtime as an analog value the user can bind to an analog action (or some filter).

Do we need to declare if a value has a range of [-1, 1] or [0, 1]? Is it implied by the number of values? By the input name?

The application declares an analog action with range [-1, 1] or [0, 1]. The runtime fulfills this need.

Do we care about haptics in this first pass? I'm leaning towards no for simplicity, but could easily be convinced otherwise.

It would be nice to have. Replace "/in/" with "/out/" in those action manifests above to have a named haptic action the application can use to signal to the runtime when a vibration should occur.

Thank you @toji for opening up this discussion. I think this is a difficult problem and it would benefit a lot of people (myself included) if solved well.

This is excellent feedback, @johnshaughnessy! Within the CG we've discussed a few of the items you've brought up, so I want to surface some of the discussion here to give more context. Please don't take it as a dismissal of your points, though!

First off, I completely agree that a browser which you interact with in AR/VR would provide it's own OpenXR bindings for interactions such as clicking, scrolling, back, etc. The fact that OpenXR would make those remappable for the user is great for accessibility and forward compatibility, so wins all around!

As for the page mappings themselves, OpenXR provides what it calls "Action sets" (See Page 29 of the GDC slides), which would typically be used in something like a game to provide a different set of mappings between, say, gameplay and menu modes. Here we could create different action sets for traditional browsing and WebXR content, which generally covers the concerns you had about making the browser appear as a different application.

Given that, there's two approaches to how those action mappings could be applied: One is that we provide a single, static mapping for all WebXR content, which is basically the approach that my first post advocated for. We'd get a mapping that says something like:

[{
 actionName: "triggerClick",
 defaultBinding: "/user/hand/left/input/trigger/click"
 valueType: "boolean",
},
{
 actionName: "triggerValue",
 defaultBinding: "/user/hand/left/input/trigger/value"
 valueType: "scalar",
},
/* etc... */
]

And every WebXR page would use it. This does allow some remapping, in that you can set browser-wide changes for the various inputs, but you couldn't set a custom mapping for just, say, threejs.org.

The second approach, which is what you've implicitly suggested, is to have a separate action mapping per page. This feels attractive at first glance, but it leads to some tough technical and privacy issues.

Is the action set managed per-origin or per-url?
- Allowing pages to set their own native action set name is a non-starter because of namespace conflicts, so it has to be tied to the domain somehow.
- If it's per-origin then you're going to make life very hard on portal sites that collect a lot of experiences in a single location.
- If it's per-url, you have an issue now differentiating between pages that use query args. is example.com/xr-stuff.html?state=foo the same app as example.com/xr-stuff.html?state=bar? Usually yes, sometimes no.
In both cases you have an issue of privacy, in that the remapping utility has now become a secondary browser history that captures every WebXR-enabled page you visit. (Or at least every domain you visit, which isn't a whole lot better)
Or, alternately, we somehow prevent those action sets from being captured by the remapping utility, and now you've lost one of the big benefits of using action mapping in the first place.
Or, lets say that you obfuscate the URLs. Now there's new problems: You've made it significantly harder to figure out what action set you want to remap, and you've filled your remapping utility with lots of junk entries from sites you only intended to visit once. Plus you still have an issue with the action names themselves being potentially privacy-sensitive. "Teleport" is reasonably generic, but "igniteLightsaber" is highly suggestive that you're playing a Star Wars game. I'll allow you to extrapolate how this applies to adult sites.

Beyond those issues, however, we'd also run into the problem that in order to expose mapping of this type to Javascript you'd likely have to either just expose the OpenXR symantic path syntax directly or do something that is trivially and directly transformable into it. That carries an implication that we view OpenXR as the canonical backend for WebXR which, name similarities aside, is simply not the case. In order to avoid that and be more platform agnostic we'd probably end up creating a syntax that would require browser developer intervention to enable new devices/paths as they became available, and that's not an improvement in any meaningful way over simply having some static input types pre-mapped.

I haven't addressed every point you raised, but I'm going to have to leave it at that for now due to schedule constraints on my end. I'll try to leave additional comments soon to respond to anything I missed. Still, hopefully that gives some insight into some of the reasons we've been reluctant to fully embrace the OpenXR input model on the web so far. (I'm overall pretty positive about that model for more traditional apps. This is just one of those areas, as with so many other areas of computing, where the web is weird).

Thanks for kicking this off @toji. I like the initial proposal and I like how a hand api would go side by side with this and controllers like Touch and Knuckles could have both.

I do think it's important to have access to all possible information such as how far off the fingers are from the controller and haptics, which I think should at least have the same support as the current Gamepad API does as some WebVR sites are using it like Space Rocks.

I also think it's important to not use the same API that OpenXR offers, even though it's very versatile and a nice abstraction. It doesn't feel very webby to me and hard for people to get into it. Maybe in the future it becomes the common way to interact with all possible XR input sources and at the point we can create an API like that, and as you're already looking into how this API works on top of the OpenXR one, it could be brought in more easily.

Thanks for writing this up, @toji! While it would be nice to expose some action-like system in the future, I believe this more literal mapping is the right path for now, given the issues you discussed above.

Some comments on the details of the proposal:

readonly attribute FrozenArray<XRInputState> buttons;

One of the key advantages of this approach vs. the current Gamepad API is to give strong names to the axes: trigger, joystick, etc. We should explore doing the same for buttons such as menu as well, especially since what is an "axis" and what is a "button" is a squishy line given that controls like the "grip" can be touch-sensitive or not or have analog triggers or not depending on the particular controller. If we just define a flat array, I worry that various UAs and XR platforms will diverge in the index they give to equivalent controls, which will get us back to where are today with the Gamepad API in WebVR.

If we believe the buttons array was giving us an escape hatch for more exotic XR controllers (e.g. an XR belt with 10 buttons), we should define that escape hatch more explicitly. For example, perhaps next to the well-known attributes like trigger, we define a controls dictionary, which has the same well-known attributes as keys, along with an ability for UAs to expose other keys with some UA-specific prefix to avoid collisions. If we don't like this due to the possibility for divergence across UAs, we should then think carefully about exposing unspecified buttons values, which will likely lead to the same divergence, but with numbers instead of string keys.

For example, the sample code currently checks for a 2-axis touchpad by seeing if the touchpad's value array has two axes. However, if a touchpad had only 1 axis, it's not clear if that would imply a horizontal x touchpad or a vertical y touchpad. By testing for well-known attributes, we can make this explicit and make the code easier to read:

interface XRInputStateDouble {
  readonly attribute boolean pressed;
  readonly attribute boolean touched;
  readonly attribute double value;
};

interface XRInputStateVector2 {
  readonly attribute boolean pressed;
  readonly attribute boolean touched;
  readonly attribute double x;
  readonly attribute double y;
};

interface XRControllerState {
  readonly attribute XRInputStateDouble? trigger;
  readonly attribute XRInputStateVector2? joystick;
  readonly attribute XRInputStateVector2? touchpad;
  readonly attribute XRInputStateDouble? grip;
  readonly attribute XRInputStateDouble? menu;
  readonly attribute XRInputStateDouble? a;
  readonly attribute XRInputStateDouble? b;
};

partial interface XRInputSource {
  readonly attribute XRControllerState? controllerState;
};

Combining explicit button attributes and the stronger state interfaces, this results in more readable WebXR input code:

let inputSource = xrSession.getInputSources()[0];

if (inputSource.controllerState) {
  // Is a controller with buttons and stuff!

  let joystick = inputSource.controllerState.joystick;
  if (joystick && joystick.value.x && joystick.value.y) {
    // Has a 2-axis joystick!
    PlayerMove(joystick.value.x, joystick.value.y);
  }

  let jumpButton;
  if (inputSource.controllerState.a) {
    jumpButton = inputSource.controllerState.a;
  } else if (inputSource.controllerState.menu) {
    jumpButton = inputSource.controllerState.menu;
  }

  if (jumpButton && jumpButton.pressed) {
    PlayerJump();
  }

  // etc.
}

Also, while doing the feature tests inline with the values feels very webby, we should decide what it means for inputSource.controllerState to be available or not. If a controller is paired but is not sending data at the moment, can I still inspect the truthiness of its attributes to see what controls will be available? Or should a UA just simplify things for apps and give some default values for any controller that is enumerating?

I think we will need hand centric options for things like leap motion. That said, it would have to track a hand skeleton, and I'm not sure how to handle things like the steamvr knuckles.

That said, the possibility of other biometric data being read such as heartbeat tracking should be considered as well, maybe that should be it's own thing? If that were to be implemented, that would need a privacy/security perspective included and permissions for obvious reasons.

This issue was moved to immersive-web/webxr#392

immersive-web / proposals