immersive-web / webxr-hand-input

A feature repo for working on hand input support in WebXR. Feature lead: Manish Goregaokar
https://immersive-web.github.io/webxr-hand-input/
Other
106 stars 17 forks source link

Have a joint set object that lets you fetch all joint poses at once #37

Closed Manishearth closed 4 years ago

Manishearth commented 4 years ago

During discussion of https://github.com/immersive-web/webxr-hand-input/pull/32 in today's call it seemed like while everyone wanted to solve the problem of ~200 objects being created per frame, folks didn't want to cache XRJointPoses because it would be unexpected.

OpenXR's API for this returns a single object containing all joint pose info. We could do something similar.

We wish to keep the joint spaces because it's useful to be able to relate them with each other via getPose(), but it would be nice if there were some "all hand joints" object that could be accessed for efficiency.

I'm imagining an API like:

interface XRAllJoints {
   constructor(XRHand hand, XRSpace baseSpace);
   void populatePoses(XRframe frame);
   DOMPointgetPosition(unsigned int id);
   DOMPoint getOrientation(unsigned int id);
   Float32Array getTransform(unsigned int id);
   float getRadius(unsigned int id);
}

cc @cabanier @thetuvix @toji

cabanier commented 4 years ago

Is transform needed? It seems a UA can calculate it from position and orientation

void populatePoses(XRframe frame);

We probably should also pass XRHand here.

Manishearth commented 4 years ago

Is transform needed? It seems a UA can calculate it from position and orientation

Right, we'll calculate it for the user, the same way we do for XRRigidTransform. It can be lazily populated.

We probably should also pass XRHand here.

Nah that should be in the constructor

cabanier commented 4 years ago

Is transform needed? It seems a UA can calculate it from position and orientation

Right, we'll calculate it for the user, the same way we do for XRRigidTransform. It can be lazily populated.

The conversion is so easy; we shouldn't add this to the API.

We probably should also pass XRHand here.

Nah that should be in the constructor

Yes, you're right :-)

Manishearth commented 4 years ago

I find it weird that we would return transforms in XRPose but not here, and also would like to avoid people calling new XRRigidTransform(pos, orient).transform

Manishearth commented 4 years ago

A thing to note: my current API does not reduce the JS-system traffic much. If we wish to do that we'd probably have to introduce a way to get large frozen arrays wholesale from the API.

Maksims commented 4 years ago

Low level solution would be providing something like Float32Array to API, which would populate it with interleaved data. But that of course would be something "non-weby", although highest performance and allocation-free:

interface XRAllJoints {
   constructor(XRHand hand, XRSpace baseSpace);
   void getPositions(XRframe frame, Float32Array data);
   void getOrientations(XRframe frame, Float32Array data);
   ...
}

Regarding transform, it is interesting to see if people use it, or actually use position and orientation instead. Especially how does this relates to engines with different coordinate systems, do they first modify position/orientation and then construct transform or they take transform and flip it by axes? In PlayCanvas we have same coordinate system as WebXR API, and used position + orientation in case of Hands and used transform because getPose returned transform only, with position and orientation inside of it. So seems like there are some inconsistencies already.

cabanier commented 4 years ago

Could we further simplify this with just a call on XRFrame:

partial interface XRFrame {
  bool getjointPoses(XRHand hand, XRSpace baseSpace, Float32Array positions, Float32Array orientations);
}
Manishearth commented 4 years ago

You'd want radii as well. But yes, we could do that. It feels less webby but we do have a webby API too.

I think frameworks are not going to be using the flat arrays as flat arrays, though.

cabanier commented 4 years ago

If you use DOMPoint each attribute access is a trip to c++ so that's a bit expensive as well.

we do have a webby API too.

yes.

toji commented 4 years ago

I'm not sure if I understand the motivation behind this returning position/orientation/transform separately instead of XRRigidTransforms? Then the question of whether or not matricies are included becomes moot. (Also, I'm not a fan of the idea that transform everywhere else means XRRigidTransform and here it means a matrix.)

Could we further simplify this with just a call on XRFrame..

Yeah, Rik's snippet is a bit more what I had in mind when I brought this up on the call. I was actually wondering about taking it a step further and allowing this variant to be used for ANY spaces, like so:

partial interface XRFrame {
  bool getPoses(sequence<XRSpace> spaces, XRSpace baseSpace, Float32Array positions, Float32Array orientations);
}

At which point I could use this for everything but the viewer pose if I wanted. Coincidentally, since the hand is just an enumerable list of joint spaces it would still work to simply do this:

frame.getPoses(inputSource.hand, referenceSpace, outPositions, outOrientations);

Regarding transform, it is interesting to see if people use it, or actually use position and orientation instead.

I know that at least Three.js uses transform matrices directly.

Manishearth commented 4 years ago

I'm not sure if I understand the motivation behind this returning position/orientation/transform separately instead of XRRigidTransforms

We can't! XRRigidTransforms contain DOMPointReadOnlys, so suddenly we have a way of modifying them, which is ... bad. What would happen if I took one of these transforms and slapped it into getOffsetReferenceSpace()? 😄

Yeah, Rik's snippet is a bit more what I had in mind when I brought this up on the call. I was actually wondering about taking it a step further and allowing this variant to be used for ANY spaces, like so:

I felt that having a single output object would be very useful here. We could even give it both APIs with a DomPoint getter and a Float32Array attribute for each thing, and lazy initialization would mean these only get updated when you need them.

cabanier commented 4 years ago
partial interface XRFrame {
  bool getPoses(sequence<XRSpace> spaces, XRSpace baseSpace, Float32Array positions, 
                Float32Array orientations);
}

I like that one! This way authors can efficiently get a subsection. We do want to pass XRJointSpaces though and return radii.

cabanier commented 4 years ago

Yeah, Rik's snippet is a bit more what I had in mind when I brought this up on the call. I was actually wondering about taking it a step further and allowing this variant to be used for ANY spaces, like so:

I felt that having a single output object would be very useful here. We could even give it both APIs with a DomPoint getter and a Float32Array attribute for each thing, and lazy initialization would mean these only get updated when you need them.

If we can avoid it, we should not introduce new objects. It's also an additional burden on developers because now they have to keep track of an additional helper object.

Manishearth commented 4 years ago

Yeah but it's neater than having to keep track of three different array objects IMO. I don't mind moving the API to the frame, but I still think we should have a single object being populated, ideally one which can expose multiple views over this data.

cabanier commented 4 years ago

What do framework authors think? @Maksims @mrdoob @dmarcos Should we have a single object XRAllJoints that manages the poses, or a single API call on XRFrame that returns separate arrays to describe the poses.

Manishearth commented 4 years ago

cc @raananw too

Manishearth commented 4 years ago

To be clear, my current proposal is

interface XRAllJoints {
   constructor(XRHand hand); // can also be a method on XRHand

   // Attributes for directly feeding to the GPU
   attribute readonly Float32Array positions;
   attribute readonly Float32Array orientations;
   attribute readonly Float32Array radii;

   // Getters for feeding into, e.g. ThreeJS object metadata
   DOMPoint getPosition(unsigned int id);
   DOMPoint getOrientation(unsigned int id);
   Float32Array getTransform(unsigned int id);
   float getRadius(unsigned int id);
}

partial interface XRFrame {
   void populatePoses(XRAllJoints joints, XRSpace baseSpace);
}

You would use the API like this:


// at the beginning of the session
let leftJoints = new XRAllJoints(leftInput.hand);

// each frame
frame.populatePoses(leftJoints, localSpace);

for (i = 0; i < number_of_joints; i++) {
   leftModel.joints[i].transform = leftJoints[i].transform;
   leftModel.joints[i].radius = leftJoints[i].radius;
}

// render leftModel

but you can also directly feed the Float32Arrays to the GPU. This would likely be done along with https://github.com/immersive-web/webxr-hand-input/issues/36 to ensure that the total number of joints is fixed.

We can design these to lazy-load so that by default if you are not using getPosition() those objects will not be updated.

dmarcos commented 4 years ago

No strong opinions on one function per joint vs. retrieve all joints at once. Whatever protects the most against allocations / GC pauses. Maybe I just expect consistency with other parts of the spec. A-Frame uses transform.matrix for both joints and head pose. Good work everybody.

Maksims commented 4 years ago

frame.getPoses - probably wont work, as it conflicts with Pose and JointPose. Although unified way of getting poses data into pre-allocated, reusable Float32Arrays - is a definitely good thing.

I believe accessing input source related data should be by input source, not for multiple of them in one go. From application logic, this is because input sources can dynamically be added and removed. Also tracking information can be lost on them. So case where tracking is lost on one hand but not the other - is a common case.

What about this:

boolean XRHand.getPoses(XRSpace baseSpace, Float32Array transforms, Float32Array positions, Float32Array orientations, Float32Array radii);

Where any of Float32Array arguments can be null/undefined. This provides ability to query only information needed. Also should it return true/false when it did populated information or not, based on tracking state?

Manishearth commented 4 years ago

@Maksims This is not for multiple input sources at one go, this is for one XRHand at a time, and each XRHand is tied to an input source. This is true in both my proposed API (where XRAllJoints is constructed from an XRHand) and @cabanier's (where getJointPoses() accepts an XRHand)

Also should it return true/false when it did populated information or not, based on tracking state?

Yeah this is also why I prefer the XRAllJoints approach, because each getter can return null if it wishes.

It's worth figuring out (separately, in https://github.com/immersive-web/webxr-hand-input/issues/36) if we want to make it so that either the entire hand is reported or none of it is. In that case the API you and @cabanier proposed (in https://github.com/immersive-web/webxr-hand-input/issues/37#issuecomment-665720921) would work, but I still prefer having a separate object.

cabanier commented 4 years ago

Also should it return true/false when it did populated information or not, based on tracking state?

Yeah this is also why I prefer the XRAllJoints approach, because each getter can return null if it wishes.

XRAllJoints feels a bit heavyweight. I'd much prefer if we could get by with a couple of function calls.

Thinking about this some more, we might be able to simplify the spec quite a bit. For instance, the radius is not relative to another XRSpace so there's no real reason to require a baseSpace. This means that XRJointPoseand thus getJointPose are not really needed and could be dropped and we could get by with the regular getPose.

The interface could look like this:

partial interface XRFrame {
 float? getJointRadius(XRJointSpace joint); // replaces getJointPose

  // for bulk operations
 bool getPoses(sequence<XRSpace> spaces, XRSpace baseSpace, Float32Array positions, Float32Array orientations);
 bool getJointRadii(sequence<XRJointSpace> spaces, Float32Array radii);
};

getPoses could move to the WebXR spec since it's not specific to hands. We might even be able to get rid of XRJointSpace. We might also consider using a sequence of DOMPointInit instead of Float32Array but then each access will cross the c++ boundary.

You would use the API like this:

  // each frame
  if (frame.getPoses(leftJoints, localSpace, leftModel.joints.positions, leftModel.joints.orientations) 
    && frame.getJointRadii(leftJoints, leftModel.joints.radii) {
    // render hand;
  } else {
    // remove hand
  }
Manishearth commented 4 years ago
 bool getPoses(sequence<XRSpace> spaces, XRSpace baseSpace, Float32Array positions, Float32Array orientations);

How does this work with tracking loss, though? Another array of bools?

cabanier commented 4 years ago
 bool getPoses(sequence<XRSpace> spaces, XRSpace baseSpace, Float32Array positions, Float32Array orientations);

How does this work with tracking loss, though? Another array of bools?

At least for Oculus, poses have an all or nothing success rate so a simple bool return suffices. For UA's that can do partial hand results, they could put NaN's as return values or we could define that they should return emulated values.

Manishearth commented 4 years ago

At least for Oculus, poses have an all or nothing success rate so a simple bool return suffices. For UA's that can do partial hand results, they could put NaN's as return values.

Well, the array is not just XRJointSpaces 😄 . This is kinda why I want to focus on joints and do this "for the entire hand", because OpenXR is also all-or-nothing here.

cabanier commented 4 years ago

Well, the array is not just XRJointSpaces 😄 . This is kinda why I want to focus on joints and do this "for the entire hand", because OpenXR is also all-or-nothing here.

Ok, then we could define it as "if there is tracking loss, try to return an emulated value. If an emulated value can't be returned, return NaN" Or alternatively, combine the two calls into getJointPoses and only make it work for hands.

Manishearth commented 4 years ago

Yeah, but this is kinda why I prefer the AllJoints model, because we can do both: consumers who wish to pass data to the GPU can use the arrays, consumers who want DOMPoints can check the DOMPoints. There's some JS-native traffic there but the main goal is fixing allocations IMO.

I'm not against getPoses(), though. I just find it potentially unnecessary. An advantage of that model is that you can pick and choose what you want: if you want positions, you can ask for positions, if you want transforms you can ask for transforms (and null out the others).

Perhaps we should have something like this instead?

dictionary FillPoses {
   Float32Array? positions;
   Float32Array? orientations;
   Float32Array? transforms;
}

interface XRFrame {
   void fillPoses(sequence<XRSpace> spaces, XRSpace base, FillPoses poses)
}

The advantage here is that it can be called as fillPoses(spaces, base, {transforms: transforms}) or fillPoses(spaces, base, {positions: positions, orientations: orientations}) per preferences and you don't have to pass in nulls,

Idk, I'm not a fan of passing a million parameters to a function 😄

cabanier commented 4 years ago

Perhaps we should have something like this instead?

dictionary FillPoses {
   Float32Array? positions;
   Float32Array? orientations;
   Float32Array? transforms;
}

interface XRFrame {
   void fillPoses(sequence<XRSpace> spaces, XRSpace base, FillPoses poses)
}

That's a neat idea. Do you know of any APIs on the web platform that use this idiom? (= pass optional parameters in a dictionary that are filled in by the call)

Idk, I'm not a fan of passing a million parameters to a function 😄

I agree it's not as pretty but it does save on typing.

Manishearth commented 4 years ago

Not aware of any APIs that do this.

I'm not sure it saves on typing, it's quite likely that the frameworks will be stashing positions orientations and transforms somewhere anyway

cabanier commented 4 years ago

Not aware of any APIs that do this.

Well... then we probably shouldn't do it :-(

I'm not sure it saves on typing, it's quite likely that the frameworks will be stashing positions orientations and transforms somewhere anyway

fillPoses(spaces, base, positions, orientations);

is shorter than

fillPoses(spaces, base, {positions: positions, orientations: orientations});
Manishearth commented 4 years ago

I believe you can just write {positions, orientations} but more specifically I'd expect people to have these around in an object anyway

Either way, unsure if this is good

Maksims commented 4 years ago
fillPoses(spaces, base, positions, orientations);

is shorter than

fillPoses(spaces, base, {positions: positions, orientations: orientations});

First option does not allow extra arguments, like "radius" or "transforms", while second option gives you ability to extend object with keys, depending on available data.

This is actually very generic, gives you ability to query transforms for gripSpace:

let data = { transforms: transformsData };
frame.fillPoses([ inputSource.gripSpace ], referenceSpace, data);

As well as joints data:

let data = { orientations: orientationData, positions: positionsData, radii: radiiData };
frame.fillPoses(handJoints, referenceSpace, data); 

This suits engines which use transform or position+orientation, as well as any extra data, like radius of joints. No allocations, no null arguments, pretty clean and self explanatory.

The only thing that documentation has to clarify, is how many floats are used per item, like transformations are mat4x4, so 16 floats, positions are 3 floats, orientations are 4 floats, radius is 1 float. So developer has to create Float32Array with no less data than that.

RaananW commented 4 years ago

Looking at the current API (in other modules and WebXR itself), frame.getHandPose (or Poses) feels much more consistent. Same goes to sticking with XRRigidTransforms (which is used throughout the API) and not positions/rotation/matrices.

The simplest approach i could think of was this -

interface XRFrame {
  FrozenArray<XRJointPose> getJointPoses(XRHand hand, XRSpace referenceSpace);
  // while keeping the current API for specific joint queries
  XRJointPose getJointPose(XRJointSpace joint, XRSpace relativeTo);
}

This will return poses for a single hand (and not both at the same time), but will not require calling the getJointPose API 20 times.

A different approach - similar to getViewerPose, we could also get an already populated XRHand Set with the relevant information. something like this:

interface XRFrame {
  FrozenArray<XRHand> getHandPoses(XRSpace referenceSpace);
}

which will require two changes - a different reference in the XRInputSource (xrInput.hand is no longer XRHand, but an index or some form of a placeholder that can be associated with the XRHand object), and that the XRHand object will not contain XRSpace but XRPose objects.

Manishearth commented 4 years ago

This will return poses for a single hand (and not both at the same time), but will not require calling the getJointPose API 20 times.

@RaananW Unfortunately this does not satisfy the requirements, the primary issue at hand is not that you have to call getJointPose() 20 times, it is that each frame the user agent must allocate 25 x 2 XRJointPoses (which is also 50 XRRigidTransforms and 100 DOMPointReadOnly). It's not possible to reuse the XRRigidTransforms since they're assumed to be immutable, and furthermore we expect XRPose to be something the author may wish to hold on to at times.

RaananW commented 4 years ago

Got it. But if we don't use XRRigitTransformation (or introduce XRMutableRigidTransformation), this can still be a single reusable object. Both suggestions actually.

Manishearth commented 4 years ago

@RaananW Yeah that's what my original API proposes

Manishearth commented 4 years ago

/agenda to discuss this

Manishearth commented 4 years ago

So I discussed this more with @toji and a couple things occurred to me:

Firstly, if we implement https://github.com/immersive-web/webxr-hand-input/issues/36 and require devices report "all or none", we can turn XRHand into an iterable and can pass it in whenever sequence<XRSpace> is required.

Secondly, overloading is possible! We can have both these APIs at once:

interface XRFrame {
    void getAllPoses(sequence<XRSpace> spaces, XRSpace base, Float32Array positions, Float32Array orientations);
    void getAllPoses(sequence<XRSpace> spaces, XRSpace base, Float32Array transforms);
}

Note, however, while we can have APIs that overload sequence<> the WebIDL overload resolution algorithm does not seem to be able to distinguish between sequence<A> and sequence<B>, and furthermore will not be able to distinguish between those and Float32Array; so we can only have one method per number of arguments.

We can also have a getAllJointRadii(XRHand, Float32Array radii) API that gets all joint radii at once.

I would kinda prefer to have an API that works on DOMPoints but I understand the pushback.

Maksims commented 4 years ago

And how lost tracking will be handled? Fills Float32Arrays with NaN's? Or returns boolean (modified==true) and keeps arrays untouched?

cabanier commented 4 years ago

And how lost tracking will be handled? Fills Float32Arrays with NaN's? Or returns boolean (modified==true) and keeps arrays untouched?

The functions would have to return a bool. I think @Manishearth forgot to add that since he said:

require devices report "all or none"

cabanier commented 4 years ago

I would kinda prefer to have an API that works on DOMPoints but I understand the pushback.

Since these are low level APIs, I think it's ok for them to be less developer friendly in exchange for higher performance.

Manishearth commented 4 years ago

"all or none" works if this is a hand-specific API, but not if this is an API requesting multiple spaces in general IMO

But we can set NaN

Manishearth commented 4 years ago

TAG discussion at https://github.com/w3ctag/design-principles/issues/231

toji commented 4 years ago

I take it as a good sign that we're one of several groups asking about this, rather than the weird outlier. Interested to see what comes of it.

cabanier commented 4 years ago

@Manishearth Can you prepare a PR for this? If we wait too long, we might forget some details.

Manishearth commented 4 years ago

I think the issue is that there are two approaches. I can prepare a PR for both but I'm a bit busy today.

cabanier commented 4 years ago

I believe your proposal in https://github.com/immersive-web/webxr-hand-input/issues/37#issuecomment-667308534 is the least controversial one.

thetuvix commented 4 years ago

I'll likely need to join the call late today, though I'm very interested in this topic! This exact bulk joint update API discussion was the last thing to converge in the OpenXR hand tracking discussions.

I am generally in favor of pushing on the "fill-array" style patterns here that TAG is warming up to, as we will need them even more strongly when we are returning mesh vertex/index buffers to apps. I'm most in favor of @Manishearth's original XrAllJoints proposal over the generic getPoses approach, as I expect us to have related data to serve up here over time. I could warm to the getPoses approach, so long as we find a solution to reasonably handle emulatedPosition and tracking loss, though I expect that will ultimately feel messier than bool attributes on an XrAllJoints-style type.

Manishearth commented 4 years ago

@thetuvix we're discussing this now

thetuvix commented 4 years ago

Sorry I couldn't make it to the call on Tuesday!

My primary concern with seemingly simpler approaches like https://github.com/immersive-web/webxr-hand-input/issues/37#issuecomment-667308534 is that I suspect the getPoses overload approach will become more complex than the XrAllJoints proposal once all the data values we need to communicate are represented. (e.g. radius, isDetected, isEmulated)

Here are the per-hand and per-joint values that the hand module will need to communicate each frame in some manner:

The XRAllJoints approach gives us a natural place to hang both per-hand scalars and various per-joint arrays, including any additional arrays in the future for things like linearVelocity/angularVelocity.

Note that as part of #36, we're making the key simplification for developers that if they get any joints back (i.e. isDetected is true), they will get back a usable pose and radius for all joints in the joint set they requested, even if some joints have their isEmulated bool set to true this frame.

Manishearth commented 4 years ago

We did decide to land https://github.com/immersive-web/webxr-hand-input/issues/36 so the isDetected isn't needed anymore. isEmulated might be.