immersive-web / proposals

Initial proposals for future Immersive Web work (see README)
95 stars 11 forks source link

Eye Tracking #79

Open msub2 opened 1 year ago

msub2 commented 1 year ago

With the release of the Quest Pro and eye tracking slowly becoming available to more and more users, perhaps it'd be worth looking into implementing support for eye tracking in WebXR. There's been some discussion in the past, such as in https://github.com/immersive-web/proposals/issues/25 and the latter comments of https://github.com/immersive-web/proposals/issues/70, but eye tracking was still restricted to a very small subset of headsets back then. There's an existing OpenXR Extension that appears to already expose what I would expect from an eye tracking API (namely, where the user's eyes are looking), so I imagine this could be integrated into an existing browser's OpenXR implementation. As mentioned in the past it would also be prudent to have this be a separate permission, similar to how hand tracking is treated currently.

cabanier commented 1 year ago

What are the use cases you have in mind for eye tracking?

msub2 commented 1 year ago

Off the top of my head:

klausw commented 1 year ago

We've had previous discussions about eye tracking in the group, for example at the Feb 2020 f2f, and I think there were serious concerns about the privacy impact of exposing eye tracking data through a web API.

Wherever possible, I think it would be good to investigate alternatives where the only the user agent gets access to eye tracking data. For example, in the interaction example, the UA could synthesize an XR (or DOM) input event once the user has activated an element, without exposing all the remaining eye movement data to the JS application. You've already mentioned that foveated rendering might be better handled by the UA, though even that may expose some information through a timing channel.

For the social aspect, it would be neat if the UA could apply a rotation to an eyeball object based on eye tracking data without exposing that data to the app, but that doesn't seem feasible for a multiuser application where such data needs to be shared between instances. (I guess we could pass around encrypted pose matrices that can only be decoded by the UA via a WebGL/WebGPU "encrypted uniform" extension, but that seems rather complicated.)

msub2 commented 1 year ago

I can definitely understand the concerns about security and the desire to separate the raw movement data from actual input to the application. In your hypothetical though, I'm not sure how one would go about determining when to fire an input event if the framework (let's say something like three.js) doesn't know which object you're looking at to activate in the first place. It seems to me as though meaningful interaction with an arbitrary scene would require you to give it your actual gaze direction.

cabanier commented 1 year ago

There are 2 types of eye tracking. The first one returns the pose of the location that the user is focusing on. I'm unsure how private that information is since it doesn't reveal different information than what you get with controllers or hands. This falls into @msub2 's use cases for research and interaction

The second one returns the position and orientation of the eyes. This one seems much more sensitive (ie return the user's unique IPD) so it requires more mitigations. For instance, do we really need to know the exact orientation of the eyes or can we apply generous rounding? Do we need to position of the eyes at all? This would be for the social use case.

I agree that eyed tracked foveated rendering must handled by the UA. I'm unsure how much information can be learned by timing the rendering.

AlbertoElias commented 1 year ago

If this is possible to use in native apps where it is already a privacy concern, and users need to accept a permission, what's the difference between supporting it on the Web?

cabanier commented 1 year ago

If this is possible to use in native apps where it is already a privacy concern, and users need to accept a permission, what's the difference between supporting it on the Web?

The web has a higher bar than apps because apps have to comply with a more rigorous process before they are deployed to stores. You can reach any website through your browser but you can't install just any app. Meta's Quest Pro supports eye and face tracking and we want to find a privacy preserving way to expose this on the web. We need to consensus in the group that this is a desired feature and then start drafting an explainer. We likely need a separate repos for eye and face tracking.

AlbertoElias commented 1 year ago

You can reach any website through your browser but you can't install just any app.

I see, thanks for the explanation. As PWAs are now allowed on the store, would this be a possible compromise for Web-based applications to access these sensors?

cabanier commented 1 year ago

You can reach any website through your browser but you can't install just any app.

I see, thanks for the explanation. As PWAs are now allowed on the store, would this be a possible compromise for Web-based applications to access these sensors?

As far sa I know, PWAs have to declare their permissions in the manifest, but they still have to request them from the user. I suspect the requirements will be the same as for regular websites.

AlbertoElias commented 1 year ago

Yup, I agree they should definitely request permission. I'm looking for ways for PWAs to use these APIs to be on level with native apps, and if the limiting factor is being an app accepted via a store, then maybe PWAs available on the store could provide access to the APIs?

Sorry if I'm misunderstanding something. I think eye and facial tracking are very interesting social features and as a Web developer, I would love to make use of them if possible

cabanier commented 1 year ago

I wrote down a very basic spec on how eye and face tracking can be implemented: https://cabanier.github.io/webxr-face-tracking-1/ Here's the README. Comments welcome :-)

josephrocca commented 1 year ago

@cabanier From the readme:

This technology will NOT:

  • [...]
  • give precise information where the user is looking

[...] This API will define an extensive set of expression and will on a per frame basis, report which ones were detected and how strong they are [...]

enum XRExpression {
[...]
"eyes_closed_left",
"eyes_closed_right",
"eyes_look_down_left",
"eyes_look_down_right",
"eyes_look_left_left",
"eyes_look_left_right",
"eyes_look_right_left",
"eyes_look_right_right",
"eyes_look_up_left",
"eyes_look_up_right",
[...]
}

[...] the user agent must ask the user for their permission when a session is asking for this feature (much like WebXR and WebXR hand tracking). In addition, sensitive values such as eye position must be rounded.


Some questions that I think might be relevant to the design here:

Multiplayer WebXR experiences are, I think, going to be increasingly "high-bandwidth-between-users" applications (kind of like a full-body video call with a mask), so I think it'll be inherently quite hard to make the 'average' multiplayer WebXR experience 'private' - at least if you're up against an organisation that professionally tracks users.

So in general what I wonder is how useful and, practically used by devs a low-precision API will be, given that it still requires a permission request, and given all the other info (voice, body, etc.) that the user is likely already giving (which would pretty easily ~uniquely identify them, and so reduces reluctance to give further permissions).

It seems like the aim here is to have a commonly-used low-precision API, and then (I'm guessing) a higher-precision API for certain situations, but it seems like most applications will want the higher precision API. I'm trying to think of a common use cases where the dev would request the low-precision API.

All of that said, is obviously a good idea to give devs/users the ability to only request/give exactly the amount of information they need, and no more, so I think a low-precision API is a good idea in that sense. I'm just wondering how things will actually play out here if there's a high-precision API, and if high precision is often needed, or if there is other info (like voice) which already makes privacy-preservation futile in almost all cases where face tracking is requested by the dev.

cabanier commented 1 year ago

How common would it be for an app to ask for low-precision face tracking, but not ask for body tracking?

I don't have data on that. Body tracking does give a lot more information away because it returns the positions of the user's body. This makes it a lot more sensitive which is why I have not proposed it.

Similar question for voice/microphone and high-precision hand tracking. I'd guess that voice will almost immediately ~uniquely identify you, but even with a voice changer, my guess is that a few minutes of talking would give away a huge amount of information based on the words you say and the particular way you say them.

I agree that microphone already gives up a lot of privacy. I guess the browser implementors felt that they had to add it because it was such a strong use case. (Same for camera access)

All of that said, is obviously a good idea to give devs/users the ability to only request/give exactly the amount of information they need, and no more, so I think a low-precision API is a good idea in that sense. I'm just wondering how things will actually play out here if there's a high-precision API, and if high precision is often needed, or if there is other info (like voice) which already makes privacy-preservation futile in almost all cases where face tracking is requested by the dev.

I think you answered your own question: a web API should only report the minimum what it is designed for. This API is designed to animate a person's avatar so things like eye tracking don't need to be super precise in space and time. Making those optionally high precision and giving the user a choice will be very confusing.

dmarcos commented 6 months ago

It might a good time to revisit this. Today Safari for Apple Vision Pro is shipping the transient-pointer API that allows retrieving gaze information on a pinch gesture. WebXR / WebGL applications can implement selection by gaze but not hover effects on UI elements since gaze info is only able on pinch.

There's an interactive-regions proposal to let applications define regions that the browser / OS is in charge to highlight when user's gaze hovers them. This way, the gaze information is not exposed to the page. One could implement 2D UIs but at the expense on how flexible is to integrate them visually in the application since the OS and not the application is in charge of rendering in a separate layer. Additional challenge would be interacting with objects / geometries since highlighting those requires custom shaders that are application specific.

A different route described in this issue would be exposing eye tracking information to the page. This would be the simplest and more flexible API but poses additional privacy concerns. Wonder if those can be mitigated.

Likely that in the next 1-2 years eye tracking + gaze will be the common input to all consumer headsets. We will need to converge to a solution that enables cross-platform UIs for WebXR applications.

cabanier commented 6 months ago

@dmarcos do you want to discuss this at the face to face next week? (March 25-26)

dmarcos commented 6 months ago

@cabanier thanks. maybe. where is it?

cabanier commented 6 months ago

@cabanier thanks. maybe. where is it?

Meta offices in Bellevue. You can also call in if you don't want to fly

dmarcos commented 6 months ago

Tagging for /facetoface Incorporating gaze to WebXR experiences

AdaRoseCannon commented 6 months ago

/facetoface this was missed last week we can discuss it in the unconference time at the end of the day.