microsoft / MixedReality-WorldLockingTools-Unity

Unity tools to provide a stable coordinate system anchored to the physical world.
https://microsoft.github.io/MixedReality-WorldLockingTools-Unity/README.html
MIT License
188 stars 45 forks source link

How does the WLT adjustment impact pose data shared across a network? #264

Open genereddick opened 2 years ago

genereddick commented 2 years ago

A couple questions, let me know if there is a better place to ask these:

  1. A device, let's say an iPhone, has WLT and it's camera pose is being tracked and sent over a network to a computer (not an AR device so no WLT). Should this be the world pose, thus including the WLT adjustment, the local pose, or some other property (eg FrozenFromSpongy)?

  2. Does this change if the second device is also running WLT?

  3. Does this change if not for the camera / not nested under the WLT_Adjustment transform? For example, if I want to network update some other asset pose?

  4. With the MRTK and an AR experience that would not normally need it, does Teleport need to be enabled?

fast-slow-still commented 2 years ago

This is a fine place to post questions.

First some background. Unless some form of alignment/synchronization between devices is performed, the numerical values of a pose on one device are near meaningless on another device. I say "near meaningless", because relative poses (the transform between objects' spaces) can be useful.

WLT offers support for aligning/synchronizing coordinate spaces between devices. Here are three samples:

  1. Sharing spatial info via ASA.
  2. Synchronizing spaces by scanning common QR codes.
  3. Manual alignment of spaces.

Now, to your questions:

A device, let's say an iPhone, has WLT and it's camera pose is being tracked and sent over a network to a computer (not an AR device so no WLT). Should this be the world pose, thus including the WLT adjustment, the local pose, or some other property (eg FrozenFromSpongy)?

Think of WLT as applying a correction to the camera transform via the WLT_Adjustment object, to make the camera's pose more consistent and stabile. Synchronization of spaces across devices also happens here. In general, then, this corrected camera pose is more useful than the uncorrected pose. While there are exceptions, unless you have some specific need, use the camera's global pose.

Does this change if the second device is also running WLT?

No, but again, in the ideal, you have the coordinate spaces between the two devices aligned, so that if the camera being at (0,0,0) on device A means that it is in the doorway, the camera being at (0,0,0) on device B also means that it is in the doorway.

Does this change if not for the camera / not nested under the WLT_Adjustment transform? For example, if I want to network update some other asset pose?

Again, there are many scenarios where what your application needs to know about an object is its relative transform. But when you want to record what the position in the world of your hologram is, then the global pose is the right metric to record and transmit.

With the MRTK and an AR experience that would not normally need it, does Teleport need to be enabled?

No, you don't need to enable teleport.

By the way, when I say "global" pose, I think I'm meaning the same thing as when you say "world" pose. Just make the substitution in your head.

genereddick commented 2 years ago

Thanks! A few clarification questions:

Understood about aligning devices, but in an instance where you need to align movement from a device using WLT and a device that is not world locked, the pose that includes the WL adjustment is the pose that we want to share?

Regarding aligning devices using QR Codes: Many of the examples seem to assume that the real world placement and relationship (distances etc.) of multiple QR Codes are known in advance with Holographic assets placed at matching virtual positions.

How about a scenario where you can use QR Codes but you do not know their relationships in advance (say a QR code is placed on the ground and two are placed on stands in arbitrary positions / rotations in a room, or in a field).

fast-slow-still commented 2 years ago

A SpectatorView like scenario involves some alignment mechanism between the HoloLens and the video camera. The exact transform to transmit depends on that exact alignment mechanism.

On the dynamic QR code placement, I haven't done that, but I know that it has been done using WLT. Understand what the SpacePin feature is letting you do. It is letting you say, "I want to re-position (and orient) the global coordinate space so that the physical point here that currently has this pose, will instead have that pose." Here "this pose" is the FrozenPose (or LockedPose or SpongyPose), and "that pose" is the VirtualPose (or ModelingPose). So one way or another you have to come up with two corresponding poses (for each SpacePin), one saying what the coordinates are now, and the other saying what you want the coordinates to be.

WLT samples cover a number of ways of establishing these pose pairs, but ultimately it's up to the client application (that's you).

does it matter if you scan a single a QR Code or multiple -- does using more than one QR code increase the alignment accuracy across devices?

This is a great question. Yes, multiple SpacePins increases the alignment accuracy, in several ways.

First off, the position data from scanning a QR code is much more accurate than the orientation data. But when you use the positions from two or more scanned QR codes, WLT can discard any scanned orientation data, and infer orientation from the relative positions. Because of the lever arm effect, the improved orientation at the SpacePins means improved positioning away from the SpacePins (assuming you're using SpacePinOrientable).

Secondly, with a single SpacePin, tracking error will accumulate as you move away from the SpacePin. With multiple SpacePins, you can bound the error as the system interpolates from one established position to another. See this article for more details.

genereddick commented 2 years ago

@fast-slow-still "A SpectatorView like scenario involves some alignment mechanism between the HoloLens and the video camera. The exact transform to transmit depends on that exact alignment mechanism."

In our case, similar to SpectatorView, we have Camera Extrinsics from a calibration step that captures the offset between the camera and HL. For the actual HL pose we are getting the pose from the Windows.MixedReality.SpatialLocator lib:

    unityCoordinateSystem = spatialLocator.CreateStationaryFrameOfReferenceAtCurrentLocation().CoordinateSystem;
    ...
    SpatialLocation headPose = spatialLocator.TryLocateAtTimestamp(timestamp, unityCoordinateSystem);

     rotation = headPose.Orientation;
     position = headPose.Position;
    ...
    //convert from right handed coordinate system to left to get the hololensPose
    ...etc.
     return hololensPose;              

My assumption was that the hololensPose output (+ Extrinsics offset) from the spatialLocator would then need to also be updated with the pose from the WLT_Adjustment transform before sending it to the PC but I'm not sure?

fast-slow-still commented 2 years ago

Ah, I see. Yes, I agree, the global camera pose (updated via WLT_Adjustment) is the one you want. It will be more consistent with the actual offset between the HL and the camera in physical space.

fast-slow-still commented 2 years ago

How about a scenario where you can use QR Codes but you do not know their relationships in advance (say a QR code is placed on the ground and two are placed on stands in arbitrary positions / rotations in a room, or in a field).

I've been thinking about this. As I said, there are ways to make this work, but if you don't know ahead of time the relationship between the QR code placements in physical space, it makes me wonder if QR codes are really the right synchronization mechanism?

But, sticking with QR codes for a minute, how could you make them work in your dynamic scenario? The following workflow might be instructive and give you some ideas. I've got to make some assumptions about the application's requirements, and those assumptions might be totally inappropriate for your case, but again, this workflow is meant more to be instructive than a recipe for success.

Assumptions:

  1. You'll have one person do the setup (the curator). They have to do extra work.
  2. The process is extremely easy for all other users (the clients). They don't have to do anything harder than scan QR codes.
  3. You have some way of passing data between devices, at least offline during setup.
  4. No communications between devices is available during runtime.

The workflow:

  1. Curator affixes a small number of printed QR codes in the physical environment at arbitrary locations.
  2. Curator scans the QR codes and gets a global space position for each.
  3. These QR code name - position pairs are stored in a file which the clients can read at runtime (or baked into the application or whatever).
  4. (Note that these position coordinates are entirely arbitrary, depending only on where the curator was at startup. If desired, SpacePins could be used by the curator to manually align the coordinate space to the physical environment before scanning the QR codes.)
  5. Clients go through a calibration phase at startup, which consists of: a. Scan some or all of the QR codes b. For each scanned QR code, get a global space position. c. Look up the QR code name in the table of name/position pairs. d. The position just scanned is the FrozenPose (what the pose is now). The position from the table is the VirtualPose (what we want the pose to be). e. Create a SpacePinOrientable and feed it that FrozenPose/VirtualPose pair.

At the end of all this, all of the clients will have coordinate spaces aligned with each other and with the curator.

Let me know if I left out any important details, but I've probably gone on too long already.

genereddick commented 2 years ago

Thanks for all the above!

"I've been thinking about this. As I said, there are ways to make this work, but if you don't know ahead of time the relationship between the QR code placements in physical space, it makes me wonder if QR codes are really the right synchronization mechanism?"

What are the synchronization alternatives? A few parameters would be:

I couldn't see a way to place or raycast markers in the room (not possible with the camera-mounted HoloLens). It may be possible to prescan a space with a head-mounted HoloLens and upload it to other camera-mounted HoloLens, but I didn't think this would work for iOS devices -- though as long as they can be aligned with the HLs having different strategies for different devices would be OK.

It seemed like QR codes were the best option given the above (if accurate enough maybe even just a single QR code), but would love to find any better solution possible.

zantiu commented 2 years ago

An external tracking solution like VR-gaming solutions use?

fast-slow-still commented 2 years ago

@genereddick , it might help me understand your situation better if you point out the specifics where the dynamic QR code workflow I proposed above doesn't fit your requirements?

genereddick commented 2 years ago

@fast-slow-still First, I want to say that WLT is amazing, just a huge improvement overall on a single device with almost no effort. However, for more complex scenarios my understanding remains, well, spongy is probably an appropriate word.

I have implemented your suggestions, more or less, and I have a few additional clarification questions:

For this, I'm testing with 3 devices:

In this scenario, I want the devices to agree on a fixed world origin and orientation. We will instantiate an object at the pose of QR Code 1 then it will move to QR Code 2 then to QR Code 3. Each iPhone should see them in the same pose and the same on the PC relative to the origin.

Question 1: Is there a way to change the world origin on each device to a fixed pose. For example, I start iPhone 1 and it's world origin is at position (0,0,0), I start iPhone 2 rotated 90 degrees from iPhone 1, it's world origin is also (0,0,0). I scan a QR code one meter in front of iPhone 1 in Z (0,0,1) but on iPhone 2 it is at (-1,0,0). Is there a way to make that new Z position the origin so that the WLT global position on both devices reports it as (0,0,0) but iPhone 1 now says that it's camera is at (0,0,-1) and iPhone 2 says it's camera is at (1,0,0)? The intention here is that each device is reporting its WLT pose relative to the shared origin.

Step 1: To start, I place the physical QR Codes images in random places in a room. Turn on iPhone 1 and run a curation step and scan each QR Code. The image recognition sees the QR code image and places a visualization asset on top. I record this location, so for QR_Code_1:

var globalPose = trackedImage.transform.GetGlobalPose()) 

and save/send it to iPhone 2 and the PC and set the matching gameObjects on iPhone 1 in the scene to that pose.

QR_1_GO.transform.SetGlobalPose(globalPose);

Question 2: How do you know if an asset's pose is using the WLT versus the pose from APIs? For example trackedImage transforms are instantiated and positioned using the AR Foundation ARTrackedImageManager. Does this imply that on each update the trackedImage global pose would need to be updated to the correct WLT frozen pose?

 Pose frozenPose = WorldLockingManager.GetInstance().FrozenFromSpongy.Multiply(trackedImage.transform.GetGlobalPose());

Question 3: Is there a way to test if a particular pose (from image tracking or other) is from the raw API or has already been adjusted by WLT, say by comparing a pair of poses?

Step 2: The other devices receive the curated pose positions (either on start or when updated) and set their own local instances of QR_n_GO to the same virtual positions.

Step 3: In parallel, iPhone 1 sets the SpacePin at each QR_n_GO location.

QR_1_GO.SpacePinOrientable.SetFrozenPose(frozenPose);

Question 4: Is modeling pose the same as virtual pose? Does the SpacePin modeling pose need to be reset in some way prior to assigning it a new modeling / virtual pose? When would you need to use ResetModelingPose()?

Step 4: iPhone 2 now scans the QR Codes. Each matching gameObject already has its virtual pose set and now we just have to set the frozen pose to the pose from the scanned QR Code:

QR_1_GO.SpacePinOrientable.SetFrozenPose(frozenPose);

Question 5: Does anything change in terms of the poses sent to the PC?

Additional Questions:

My mental model is that, for placement accuracy, I want SpacePins at positions where I will place assets or other actions I care about. And, I want to scan the area so that there are anchors around where assets will be placed -- so with the visualizer tools I want to see anchor markers bounding all the positions where I have placed SpacePins (at a minimum).

Question 6: Is this the right way to think about anchors (created by default by WLT) / pins (are they actually the same thing, just one is automated and the other is user positioned)? Or should pins also be bounding areas I care about (and, if so than how do they differ from anchors in usage)?

Persistence on Mobile devices. The docs say that automatic persistence only works on HoloLens or if you are using ASA.

Question 7: Is this some fundamental limitation due to the way the WLT creates and stores anchors internally? Just outside of the scope of the WLT? Assuming the latter, any high level thoughts on how you could persist the data on mobile?

Align Subtree:

Question 8: If the devices need to have a known shared world pose (rather than calculated a run time from the positions of randomly placed SpacePins), does this imply that it is better to use the subtree method, so that one QR Code defines the world pose and all other QR Codes are defined in relationship to that parent?

fast-slow-still commented 2 years ago

Hi @genereddick, I've been out of the office on vacation, and won't be back to work until March 21, but those are excellent questions, and I will try to get answers to them written up here today.

fast-slow-still commented 2 years ago

Question 1:

I'm not sure I understand this question. Your description seems exactly what SpacePins do. Have your tried some of the SpacePin samples? Even ones that aren't appropriate for your scenario might give you a better feel for what SpacePins do.

It's important to understand that the only difference between the SpacePin examples is the UX.

The system needs to know what point in space you are talking about. That's the FrozenPose (or equivalently SpongyPose/LockedPose, since they all differ by known transforms). And it needs to know what coordinates you would like to assign to that point in space (position & orientation). That's the VirtualPose, which is the same as the Modelling Pose.

So I might have better named them FrozenPose ==> CurrentPose, and VirtualPose ==> DesiredPose. But I didn't, so you'll have to translate in your head. Sorry about that.

So the only difference between the SpacePins sample and the RayPins sample, is that in the SpacePins sample you grab the SpacePins and manipulate them by hand into the current location. In the RayPins sample, you click on the environment's spatial mesh and do a ray cast to find the current location. With the QRSpacePins you scan a QR code to get the current location.

In all those samples, the desired pose is the pose the corresponding Transform is given in the Unity scene. Once the current pose that we are talking about is established by one of the mechanisms above (or even some other mechanism you come up with), then all of the samples are the same, they pass the "current pose"==FrozenPose and the "desired pose"==VirtualPose==ModelingPose as a pair to the AlignmentManager.

The point is, that even a sample that doesn't collect the "current pose" the same way that you intend to (e.g. RayPins collects by ray cast but you want to collect from QR codes), might help you get a feel for what is going on. And give you a better sense of what to expect.

Because your example of iPhone 1 and iPhone 2 is, if I am understanding it correctly, exactly what you should expect.

fast-slow-still commented 2 years ago

Question 2:

Any API should be reporting things in Unity's current global coordinate space, or else document what coordinate space it is reporting in. Sadly, this is not the case. Any API that ignores the camera's global transform (including its parent's transform) is incorrect, and in a perfect world would be fixed.

However, we don't live in a perfect world, so there are APIs that report in the space relative to the camera's parent but think they are reporting in Unity's global coordinates. These APIs usually don't document what coordinate space they are reporting in. I have dealt with APIs that aren't even consistent about what space they report in (e.g. some functions return coordinates in Unity global space while other functions return coordinates relative to the camera).

The owners of APIs that are inconsistent are usually quite interested in learning that, and pretty quick to fix them. So if you see inconsistent behavior, please do report it.

But for the most part, I find it safest to test APIs that I will depend on to verify what space they are working in. The easiest way to do that is to set up the simplest possible project using an API to be tested (no WLT). Have the camera attached to a dummy object. Run your test app and use your API. Offset the camera's dummy parent by some amount (e.g. (20,30,40)) and rerun your test with the same camera starting position/orientation. Compare your results from first test and offset.

For example, if your ARTrackedImageManager. If you get back the same pose on both tests, it is ignoring the camera's parent pose, and you need to use FrozenFromSpongy to transform its return values to global Unity coordinate space. If the second run the returned pose is offset by (20,30,40), then it is taking the camera's global pose into account, and no correction is necessary.

Sorry there isn't a more automated way.

fast-slow-still commented 2 years ago

Question 3

I think I just answered this with Question 2. But to emphasize, WLT only adjusts the camera's transform. Other APIs can either take the camera's full global pose into account, or ignore it and operate relative to the camera's local pose. They should document which they are doing, but you should verify.

fast-slow-still commented 2 years ago

Question 4:

Yes, VirtualPose == ModelingPose. The ModelingPose.

The underlying mechanics changed under the hood slightly with release v1.5.8, to make for a more robust experience with changing the modeling pose at runtime, so I do recommend v1.5.8.

The modeling pose is captured at Start() with a call to ResetModelingPose(). If you change the SpacePin object's local pose, you must call ResetModelingPose() for it to be captured, and then SetFrozenPose() (or SpongyPose or LockedPose) for it to take effect.

Pre-v1.5.8, if you changed the SpacePin's parent's pose, then you also needed ResetModlingPose() as above. But from 1.5.8 onward, this is no longer necessary.

fast-slow-still commented 2 years ago

Question 5:

I don't think so, but see answer to Question 2.

fast-slow-still commented 2 years ago

Question 6:

No, not really. The right way to think of it is that anchors are used to stabilize the global space relative to the physical world. They don't give you any control over what the coordinates are at a given location, they only promise that the coordinates there won't change.

SpacePins don't do anything to stabilize the coordinate space (because the underlying anchor graph has already done that). SpacePins just give you control over what the coordinates are at selected locations.

So, it's best to scan the area where the users are going to be. It's not important to scan the areas where the assets will be placed. Often, those are the same areas, but it's really just where the users are that's important.

The big difference between the anchor and SpacePin distribution through the area is that you want the anchor graph to be where the users will be, but you want the SpacePins to form a border around where the users will be.

This is because the system basically uses the anchor(s) closest to the user for reference, but interpolates between SpacePins.

fast-slow-still commented 2 years ago

Question 7:

Persistence on Mobile. That's a great question. Persistence on Android doesn't appear to be an option yet, except through APIs that would be redundant with ASA. Persistence on iPhone (ARKit) seems to be possible, but I haven't investigated it thoroughly. I'll add a work item to myself to at least scope the work so it can be prioritized onto the roadmap.

fast-slow-still commented 2 years ago

Question 8:

No, you are better off using all of your SpacePins to define the global space. The AlignSubtree should only be used where it's required that a sub-set of your scene have a different coordinate space than the global.

fast-slow-still commented 2 years ago

Feel free to ask for clarifications on any of this, but keep in mind that I am out on vacation until Monday. Have a great week!

fast-slow-still commented 2 years ago

274

genereddick commented 2 years ago

Re Question 2. Per your guidance, to test, first I got the pose of the trackedImage with WLT active:

void OnTrackedImagesChanged(ARTrackedImagesChangedEventArgs eventArgs)
{
    foreach (var trackedImage in eventArgs.added) 
    {
    //etc.

and got:

    trackedImage position: -0.75, -0.77, -0.50

Then, leaving most things in place, I deactivated the WLT Manager and put a position on the WLT_Adjustment transform of 10, 10, 10. On getting a tracked image, I compared the GlobalPose of the trackedImage with the GlobalPose of the camera, the camera parent (MixedRealityPlayspace) and the parent parent (WLT_Adjustment). The result being:

WLT_Adjustment position: 10, 10, 10
MixedRealityPlayspace position: 10, 10, 10
Camera position: 9.51, 10.06, 9.50 (having moved the phone a little bit from start to see the image)

but, the pose of the trackedImage was the same:

trackedImage position: -0.75, -0.77, -0.50

So, if I understand correctly this means that AR Foundation is setting the trackedImage Global pose to a value based on an offset to the camera's current local pose, rather than the camera's global pose?

In which case, when I go to set the position of the QR Code, I need to do something like:

var spongyPose = trackedImageTransform.GetGlobalPose();
Pose frozenPose = WorldLockingManager.GetInstance().FrozenFromSpongy.Multiply(spongyPose);
//Set QR Code position to frozenpose

And any transforms poses associated with the trackedImage would need to be similarly adjusted.

Does this seem correct?

fast-slow-still commented 2 years ago

@genereddick , yes, your work looks correct to me.

genereddick commented 2 years ago

I still feel like I'm missing something, or perhaps my expectations are unrealistic.

I have a scene with 5 images placed randomly around the room. The scene has gameObjects -- Pin 0 - 5, each with a child axis and a spacePin component. I start a device, detect an image position and instantiate a new axis (and marker) to follow the image as it is updated by AR Foundation. I immediately pin it.

//I instantiate a displayAxis gameObject at the same position as the AR Foundation tracked Image, 
//Its axis is centered over the image
displayAxis.transform.transform.SetGlobalPose(trackedImage.transform.GetGlobalPose());

//I set the space pin, at this point both axes (for the image and the space pin) should overlap
 spacePin.transform.SetGlobalPose(displayAxis.transform.GetGlobalPose());
 spacePin.ResetModelingPose();
 spacePin.SetFrozenPose(displayAxis.transform.GetGlobalPose());

//If the image tracking determines an image has moved -- fairly frequently -- I repeat the process
//Iincluding calling ResetModelingPose

I repeat the process above with each image. However, as I add more images I start to get divergence between paired axes -- more than I would have expected: These are two images placed on walls about 5 meters apart. If perfectly accurate both axis would completely overlap:

image image

A few things I have considered:

  1. I need to do a more thorough scan to get a better spatial map and accurate distances around the room, as well as scanning close to the surface of the image to deal with the AR Kit image tracking quality?

  2. I initially assumed that the trackedImage.transform was being created in spongy space and so converting that to frozen, but the code below seemed to create greater inaccuracy so I went back to assuming the tracked image it is in frozen space, but maybe I'm not correctly getting/setting the correct poses values?

    var frozenPose = wlt.FrozenFromSpongy.Multiply(trackedImage.transform.GetGlobalPose()); displayAxis.transform.transform.SetGlobalPose(frozenPose);

  3. Perhaps the flow is wrong: currently I'm setting the pin global position, calling ResetModelingPose, then immediately pinning (all the pins start at the origin). maybe I need to curate -- set all the global positions / resets -- before I pin any of them?

  4. Maybe something to do with the rotations, as you mentioned the axis out of the image/QR Code is less accurate and perhaps using the rotation values is adding error?

Another example:

image image

Interestingly, with only a single image / pin the accuracy was actually higher -- in that case the difference was about the width of a single macbook keyboard key.

image image