immersive-web / hit-test

https://immersive-web.github.io/hit-test/
Other
80 stars 31 forks source link

Hit test results are fundamentally rays, not poses #86

Open Manishearth opened 4 years ago

Manishearth commented 4 years ago

A hit test result is fundamentally a ray with the origin being the point of intersection, and direction being perpendicular to the plane being intersected with. This is how, for example, Unity represents it.

We're representing it as an XRRigidTransform/XRPose that produces such a ray, i.e. an XRRigidTransform which is capable of transforming the y-axis ray to the desired ray. However, such a transform is not unique (it can put the x and z transforms anywhere), and it seems weird to provide this information in such a form. It would be better if we directly produced a ray -- it's always possible to get a rigid transform out of a ray if desired.

[ARCore](https://developers.google.com/ar/reference/java/arcore/reference/com/google/ar/core/HitResult#getHitPose()) defines it more rigidly, and the tests follow this definition. Note that the only thing actually dependent on the planes is the intersection point and normal, everything else can be computed given the point, normal, and presupplied input ray.

To me the use of poses/rigid transforms here doesn't make sense, it's fundamentally not a pose. We can attach additional made-up axes to it (the way ARCore does) to make it into a pose, but I don't see the practical use of doing this. The user will likely wish to consume this information in ray form as well, I mostly see users using the position, and if they actually use the orientation it would be to obtain a direction vector.

I suppose using it as a pose can be useful for positioning little indicator objects on the place the hit test actually occurred, but as mentioned before the user has enough information to calculate this for themselves, and this doesn't convince me that an arbitrary choice of X and Z axes is useful.

The status quo is that the tests don't match the API as specced, and that must be fixed one way or the other at the very least.

I see three paths forward for this:

Thoughts? Is there a purpose for the result being a full pose?

cc @bialpio @toji

(also cc @thetuvix to inform OpenXR hit testing API choices)

bialpio commented 4 years ago

I think it's still better to present the result as a pose, even if the Z and X axes are unconstrained. My goal here was to make it familiar & consistent with frame.getPose(spaceA, spaceB). In addition, poses are very explicit about being expressed relative to reference spaces & carry additional information with them (emulatedPosition), whereas I thought the intent of the XRRay was to be a mathematical concept (similar to XRRigidTransform).

I think we can explicitly specify that X and Z axes are undefined on the returned pose. The WPTs will need to be relaxed in this case, but that should be ok.

blairmacintyre commented 4 years ago

I think the issue is really interesting, and I agree with both of you. 😄

I'd want getPose because that's such a common use and if the Z and X axes are arbitrary, many folks will be just as happy letting the system choose them for them, then doing the obvious math to do it themselves.

Perhaps simply doing what a physics collision system would do, and adding getHitPoint and getNormal (which are the two things you are encapsulating in your getRay) would make sense. There is the collision point, and there is the surface normal at the point of collision, and there is the systems "best guess at a reasonable pose"...

Manishearth commented 4 years ago

Hmm, emulatedPosition is a good point.

I think XRRays are also relative to a space given the way we supply them. I think that while symmetry is valuable, it's not worth it to make the API overall worse to get that symmetry, since now the normal vector is not directly exposed. But this is kind of academic anyway since we can't make this breaking change.

Could you update the relevant tests? We basically need a hit_test_transform_equal function that checks if positions are the same and the orientations transform the y-axis the same way.

Manishearth commented 4 years ago

Perhaps simply doing what a physics collision system would do, and adding getHitPoint and getNormal (which are the two things you are encapsulating in your getRay) would make sense. There is the collision point, and there is the surface normal at the point of collision, and there is the systems "best guess at a reasonable pose"...

I like this, this is similar to the Unity API.

What if we created an XRHitTestPose: XRPose with these APIs (as attributes, not methods)? We continue to allow implementors to arbitrarily pick how X/Z get rotated in the transform field.

The tests can be updated to check these and we can have additional checks that ensure that the transform correctly matches the point/normal.

bialpio commented 4 years ago

I think that the result of a hit test is semantically closer to XRPose (position and orientation relative to some space) than to XRRay (infinite half-line expressed in some space). There is a physical similarity between XRPose with 2 unconstrained axes and XRRay, but to me it seems like not a valid enough reason to use XRRay to represent the results of hit test. XRHitTestPose sounds like something worth considering, but my approach to the APIs was to expose capabilities that are not available - plane normal is already encapsulated in the pose and can be extracted in multiple ways. There is a lot of helpers we could add to spec, I'm just not convinced it's where our current efforts should go to.

blairmacintyre commented 4 years ago

I am not super-happy with XRRay over XRPose because (as you say @bialpio) it's not really "correct" either.

My preference would be to add update HitTestResult with a getHitLocation and getHitNormal (or similar names) to retrieve the collision point and normal, and define the pose relative to them (as discussed above).

I think many people will want the pose, because a common use is to just display some content at the location of the hit, and letting the API choose the undefined two axes is just as good (and simpler, even) than doing it yourself. But, there will be other uses that really want the point and vector, so giving it directly, seems right.

(pardon my crap IDL, I'm sure what I put below isn't exactly right, but you see what I'm saying -- perhaps I should be passing in a space, for example).

//
// Hit Test Results
//
[SecureContext, Exposed=Window]
interface XRHitTestResult {
  XRPose? getPose(XRSpace baseSpace);
  DOMPointReadOnly? getPosition();
  DOMPointReadOnly? getNormal();
};
Manishearth commented 4 years ago

Yeah to be clear I'm past suggesting XRRay, I prefer Blair's idea, though I think it should be on a subclass of XRPose, not on XRHitTestResult, since XRHitTestResult doesn't have a "default space" to relate to.

interface XRHitTestPose: XRPose {
  DOMPointReadOnly? getPosition();
  DOMPointReadOnly? getNormal();
}
blairmacintyre commented 4 years ago

I was suggesting the other way to avoid creating a pose (with "junk" in it) that we don't need.

I admit I was also avoid having to waste "work" computing (and allocating) poses that you don't want or need. By the time you create a pose, the normal is just the Y vector and the position is well defined too. This concern is probably not a big deal (since there won't be piles of hit tests per frame)

Manishearth commented 4 years ago

I was suggesting the other way to avoid creating a pose (with "junk" in it) that we don't need.

Attributes can be lazy evaluated. Implementation wise this would mean that the transform is stored as nullable, and constructed when necessary based on the type.

The thing you proposed can work if getPosition() and getNormal() have an XRSpace argument. In that case I think it makes more sense to me, and i have no strong preference.


@bialpio @toji which do you prefer:

interface XRHitTestResult {
  XRPose? getPose(XRSpace baseSpace);
  DOMPointReadOnly? getPosition(XRSpace baseSpace);
  DOMPointReadOnly? getNormal(XRSpace baseSpace);
};

vs

interface XRHitTestResult {
  XRHitTestPose? getPose(XRSpace baseSpace);
}

interface XRHitTestPose : XRPose {
  readonly attribute DOMPointReadOnly position;
  readonly attribute DOMPointReadOnly normal;
};
bialpio commented 4 years ago

I actually prefer another option, which is to refrain from adding helpers until we get strong signals that they are needed. I'm worried that this is a slippery slope and we'll turn this (and the main spec) into a bunch of mathematical utilities. Especially given that it seems to me that both of those methods could be written as one-liners in JS, something along the lines of:

function getPosition(pose) {
  return pose.transform.position;
  // alternatively:
  // return DOMMatrix.fromFloat32Array(pose.transform.matrix).transformPoint(new DOMPoint(0, 0, 0, 1));
}

function getNormal(pose) {
  return DOMMatrix.fromFloat32Array(pose.transform.matrix).transformPoint(new DOMPoint(0, 1, 0, 0));
}
Manishearth commented 4 years ago

I don't consider these helpers: I consider these to be the primary API we should have had in the first place, with .transform being a helper, if anything. We can't get rid of transform now because of backwards compatibility, but it's not the primary way we should be exposing this data to users as they will mostly care about positions and normals.

blairmacintyre commented 4 years ago

We can't get rid of transform now because of backwards compatibility, but it's not the primary way we should be exposing this data to users as they will mostly care about positions and normals.

The API hasn't been approved. We should focus on making it correct, since we aren't breaking anything approved.

In that light: the "correct" API does not return a matrix, since the underlying conceptual model does not return a pose.

I would support having a utility method that returns a pose, because that is a common use case (as a programmer, having the take the point+normal and generate a pose is "a bit of a PITA"), but framework and app programmers should really be thinking about the fact that this matrix is partially arbitrary, and they could instead pick the extra axes based on their application specific needs.

Manishearth commented 4 years ago

/agenda to figure out the best path forward here

kearwood commented 4 years ago

Exposing a matrix in the hit test results, essentially, is requesting additional information from the platform -- orientation around the surface normal vector. Some reasonable algorithms could be explored to make this useful, such as ensuring that an object is rendered facing the device or in a way that is accessible given the user's preferences.

The problem with this; however, is that it assumes that hit testing is used primarily for placement of objects.

Many consider a hit test as a low-level concept, which can be used as a primitive composing a variety of higher-level algorithms, not just placement of an object on a surface. For example, hit tests can provide world understanding for AI actors that navigate autonomously with path-finding algorithms. They can also be used to help select appropriate impulse response samples for processing of audio reflections / reverb or for rendering large-scale-ambient-occlusion effects.

Perhaps we could say that if the intent is to have the API provide a ready-to-use matrix, positioning an object in a useful way in space, that such higher-level logic be activated by requesting an XRAnchor with XRHitTestResult.createAnchor.

Unlike hit tests, anchors have the added responsibility of providing consistent orientation over multiple frames, and thus expose more information.

Manishearth commented 4 years ago

This was discussed in the call today. A major new point brought up was that hit testing is often used for orienting objects placed by the user. I'm not going to fully articulate that position for fear of accidentally misrepresenting it (but would appreciate if someone who is in that camp does so).

Currently, because a pose has additional degrees of freedom, implementations might differ on how these degrees of freedom are picked. Indeed, [ARCore and thus the Chromium implementation currently constrains this with "X+ is perpendicular to the cast ray and parallel to the physical surface centered around the hit test"](https://developers.google.com/ar/reference/java/arcore/reference/com/google/ar/core/HitResult#getHitPose()), which is not always unique and thus it's not possible to replicate the exact behavior across browsers.

The status quo of it being underspecified is not good. If content is blindly using it, it will break when used on a browser other than the one it was tested on. If content is converting it to a ray before recreating the transform it needs, there was little point to exposing the transform in the first place!

I feel like we are generally in agreement that this should at least be specified better. Correct me if I'm wrong.

We should definitely push forward on picking a specified


Now, my unease with actually specifying it and having it be the primary entry point for the API is that I don't think there's a logical way to specify it that is actually generally useful. One logical way to specify it would be to align the X or Z axis with one of the reference space axes (basically giving you a notion of "lie flat"), except there's no indication as to which reference space to use -- the hit test was likely created with an input space and won't know whether to pick local, local-floor, unbounded, or bounded-floor.

Except as @thetuvix brought up during the call, this makes sense for placing objects on the floor, you probably want something different for placing objects on a wall! And if we pick something different, what do we do with planes that are at an angle?

Plus, X needs to be perpendicular to Y. You can take your desired rough z-direction vector and cross product twice (X' = Y ⨯ X ⨯ Y) to extend it to be perpendicular, unless the two vectors are parallel. You can still "pick something" in this case, but ultimately the vector space is not a smooth one (it can't ever be due to the singularity introduced by the cross product!), so you'll have very jumpy behavior for the user if they orient themselves such that the rough Z direction is parallel to the normal, or, if the rough Z direction is defined in such a way that it can never be parallel to the normal (as it is in ARCore), they can then be in a situation where they orient themselves such that the Z direction is ambiguous. Ultimately, it's mathematically impossible to pick a smooth vector space for this. (This might be a more general concern -- I don't see applications being able to fix this either, but they have some freedom to paper over it in the way they'd find the smoothest for their experience)

So I'm skeptical that we can pick a decent choice of transform here that does what the user will expect for all kinds of planes, and even if we do, it will be very jumpy around a particular kind of configuration.

Personally I think most applications will want either "roughly aligned to world-space, but flat on surface", or "roughly aligned to input ray, but flat on surface" for "placing on floor", and a similar option for walls except instead of being "flat on" (y parallel to normal), it's "flat against" (z parallel to normal). This is a four-way ambiguity, not counting the fact that we don't know which world space the application wishes to use here (it can also be "facing the user"!).

My impression is that the ideal system here for placing objects seems to be some kind of smooth transition based on normal angle that's essentially about "tip the object in the direction that causes one of its coordinate planes to stick to the hit tested plane with the least angular displacement", but this doesn't fix the two-way ambiguity involved in picking where the object's coordinate planes are locked to, and that's application choice.


Given the above, to me it seems likely for applications to not like the choices we're making, and this makes me uncomfortable with transform as the primary route for applications. I would prefer to advertise position and normal as a primary route, with the people using transform being ones that explicitly look for it and thus understand what it does (and we explicitly spec out a behavior for it). Currently the discovery path leads people straight to transform, which feels like we're creating a pit of failure with potentially unexpected behavior around walls, etc.

transform strongly feels like a higher level utility function to me here, and the only reason i think we should have it at all is backwards compatibility. The "this is how you place objects" argument in favor of it seems to make it fall squarely in the space of it being a high level utility; frameworks should be figuring that out.

bialpio commented 4 years ago

The problem with this; however, is that it assumes that hit testing is used primarily for placement of objects.

Many consider a hit test as a low-level concept, which can be used as a primitive composing a variety of higher-level algorithms, not just placement of an object on a surface. For example, hit tests can provide world understanding for AI actors that navigate autonomously with path-finding algorithms. They can also be used to help select appropriate impulse response samples for processing of audio reflections / reverb or for rendering large-scale-ambient-occlusion effects. (...)

Correct, the API (as per explainer) focuses mainly on object placement use case. Other use cases could potentially be solved by it (maybe a bit awkwardly given the subscription-based nature of the API, as with the agent navigating the environment), but I mainly looked at it through the lens of placing objects. Tying it into the anchors API is also a good side effect of the current approach, but I did not consider it when writing the spec.

I feel like we are generally in agreement that this should at least be specified better. Correct me if I'm wrong.

Agreed, this should be fully specified to enable cross-browser compat and it needs to happen irrespective of what we decide for additional properties on the hit test result, so let's start with that.

So I'm skeptical that we can pick a decent choice of transform here that does what the user will expect for all kinds of planes, and even if we do, it will be very jumpy around a particular kind of configuration.

It may not be possible in a general case (as mentioned during the call), but I think it'll be valuable to use something that works fairly well in the common case, which to me is placing an object on the floor.

I feel like we had a similar conversation when introducing a matrix property on the ray, and it's actually a good thing we have done that since now we can leverage it to specify the algorithm here as well, roughly:

  1. Let X,Y, Z be the axes of the coordinate system coord defined by the matrix attribute of the ray used for hit test - from the definition of the matrix, Z should be pointing towards the ray origin. Let p be the intersection point, and normal be the surface normal at the intersection point.
  2. Calculate the angle alpha between plane defined by normal & Z and YZ plane. If normal = Z, set the alpha to 0 (normal already lies on YZ plane so next step is a no-op).
  3. Rotate the coordinate system coord by alpha around the Z axis, yielding coord' with axes X', Y' and Z. normal now lies on the Y'Z plane. Note: this step should probably ensure that normal lies above or on the ZX' plane - calculating alpha appropriately should take care of that.
  4. Calculate the angle beta between Y' and normal.
  5. Rotate the coordinate system coord' by beta around the X' axis, yielding coord" with axes X', Y" and Z'. Y" now is equal to normal.
  6. Construct a pose from coord" coordinate system, placing it at p.

I believe this will work (assuming I have not messed up the math, please proofread), is resilient against the edge case of normal being equal to initial Z axis, and will make the pose's Z axis point roughly towards the origin of the ray (matching Chrome's behavior) w/o relying on any data not already available to hit test.

Manishearth commented 4 years ago

I think this will work conceptually as a choice of transform given that backwards compatibility forces us to make one.

It still has a patched singularity around the "normal = Z" case, which means that the placed object ghost will rotate erratically around that point -- look at what happens when a normal that is slightly offset from Z swings across normal=Z to be slightly offset from z on the other side: the rotation component will abruptly turn by 180 degrees.

As I said before it's impossible to make a choice that doesn't have this problem of erratic behavior (the problem we're trying to solve here can be easily reduced to one that is proven unsolvable by the hairy ball theorem!).

This isn't my primary concern, however, applications are going to have to deal with this too -- there are stateful solutions that are possible (which we can't use), but overall this is a hard problem which we provably cannot solve and applications will find really hard to solve. Many will probably decide to live with it.


My primary concern is that we're not solving the general case here, and it's probably really hard to solve the general case well. We're exposing an attribute based on a known subset of use cases, which will behave quite badly the moment the user lifts their hand and has the ray hit a wall.

I'm not convinced that "objects on the floor" is so overwhelmingly in the majority of use cases for this as to justify a single transform attribute that is geared to that use case only. To me this feels very much like a utility, one that really belongs in a library alongside other utilities for different kinds of object placement (ones which handle walls! ones which somewhat handle both via the "tip object" scheme I described!), and the reason we can't remove it is backwards compatibility. I still feel uncomfortable about this being the primary path for surfacing the hit test normal.

So I still think that the best path forward is to spec transform the way you did (or some alternate way producing similar results), and then surface position/normal directly, with documentation pointing to those first, and transform very much being a "read the warning labels first" API.

Manishearth commented 4 years ago

@bialpio could you open a PR for standardizing your proposed choice of transform? I think this would at least fix the part of this issue that isn't contentious -- this API needs to be standardized.

One thing we should check: ARKit exposes a pose as well, does anyone know what the details behind that pose are? Is it the same choice as ARCore?

bialpio commented 4 years ago

@bialpio could you open a PR for standardizing your proposed choice of transform? I think this would at least fix the part of this issue that isn't contentious -- this API needs to be standardized.

On it!

One thing we should check: ARKit exposes a pose as well, does anyone know what the details behind that pose are? Is it the same choice as ARCore?

I took a quick look at ARKit's hit test result - there seem to be 3 different ways to obtain the information about the pose out of it (all matrix-based):

The documentation does not seem to explicitly state anything about the properties of the returned matrix, which is not ideal. I might be missing something here though.