lagadic / visp

Open Source Visual Servoing Platform
https://visp.inria.fr/
GNU General Public License v2.0
725 stars 286 forks source link

vpMbGenericTracker InitFromPose on iOS / ARKit #1032

Closed arvinkx closed 2 years ago

arvinkx commented 2 years ago

I'm trying to create an implementation of the Markerless generic model-based tracking (https://visp-doc.inria.fr/doxygen/visp-daily/tutorial-tracking-mb-generic.html) on iOS.

I'm trying to implement this without the user interactively tapping points on the screen to determine the initial pose. The objects are texture-less so I don't think I can do keypoint learning either. Would this be possible to achieve if using AR, I facilitated the ability for the user to move the live camera to match a specific pose? Would this pose be what I could feed into the tracker's initWithPose?

Currently, I setup everything like the tutorial except I'm calling initWithPose instead of initClick and when I call track I get an exception No data found to compute the interaction matrix.... The cMo that I'm passing in to initWithPose is just an empty vpHomogeneousMatrix, what should this matrix consist of when using initWithPose?

s-trinh commented 2 years ago

Would this be possible to achieve if using AR, I facilitated the ability for the user to move the live camera to match a specific pose?

Yes, it should be doable. What you are trying to achieve is similar to https://github.com/lagadic/visp/issues/670 But you have to provide an initial cMo sufficiently close enough to let the tracking converge.

About the vpHomogeneousMatrix, you can see these links:

arvinkx commented 2 years ago

@s-trinh Thanks for the information, it was helpful to understand what I need to achieve. I didn't see #670 when searching either.

I am still struggling to get tracking convergence to occur, I keep either getting No data found to compute the interaction matrix... or a message to adjust the tracker.setGoodMovingEdgesRatioThreshold. I've tried changing the good moving edges ratio with no luck. Would these messages result from an inaccurate initial pose or could there be another issue?

The initial pose that I am attempting to use is a translation that places the object directly in front of the camera frame (0, -0.08, 0.4) with no rotation - I'm able to visualize said transform to ensure it is correct but tracking isn't able to converge with that initial pose.

fspindle commented 2 years ago

It means that your initial pose is maybe too far to initialize the tracker. It could be also an issue with the CAD model that doesn't have faces with normals going outside the object.

To validate the config files I suggest that you acquire a video with your device and try to make the tracker working on that video on a classical computer. To this end you can follow the use cases given in that tutorial and adapt the command line options to use your video. At this point, the tutorial tutorial-mb-generic-tracker-full allows to initialize the tracking with user click. When this tutorial is working with your video and object, you can replace the tracker.initClick() by tracker.initFromPose() to see what's going on.

arvinkx commented 2 years ago

@fspindle Thanks, I tried using the teabox.cao that comes with the sample with a teabox and got the same result so I'm thinking it's the initial pose but I'm not sure. I will try to get a video working with the tutorial like you mentioned and see if I can see what's going on.

Also, when I receive the error messages, do I need to call initWithPose or setPose until the tracker can converge? Essentially, should I be calling initWithPose over and over until it converges or do I only call that once and use setPose?

arvinkx commented 2 years ago

@fspindle @s-trinh Thanks for the help on getting tracking functioning. I have one last issue I'm struggling with. I'm trying to convert to/from the ARKit/SceneKit coordinate system and ViSP to utilize the estimated pose I've received. I'm struggling to figure out the correct transformation to apply. I'm trying to convert a SCNMatrix4 which is column-major and SceneKit (iOS render engine) is X right, Y up, Z inward. I've tried looking at the docs for vpHomogenousMatrix and what I've attempted is to invert the columns/rows and then multiply by a matrix containing -1 at (1,1) and (2,2) to invert the y and z axes. Does that seem correct to convert to that coordinate system?

s-trinh commented 2 years ago

There are two different things for me:


About the 4x4 pose returned by the vpMbGenericTracker:

image


About the row-major to column-major conversion, transposing the matrix should work:

image

arvinkx commented 2 years ago

@s-trinh Thanks for the information, I believe I am able to convert between the coordinate systems correctly but I'm still having problems aligning virtual content using the estimated pose that was returned. My virtual content ends up further in ARKit's -Z direction than the actual object (as seen in the attached image - the origin of the teabox/model is the bottom back corner of the box).

IMG_3311

The amount the result is offset seems to be the roughtly the same every time I try so I'm not sure if that could be an issue with camera calibration or if I'm incorrectly applying the estimated pose. To generate the world transform, I'm multiplying cMo by the ARKit camera transform (aMo = cMo * ARKitCameraTransform) which results in the incorrect pose. The projection error is around 5 - 10 (calling tracker.getProjectionError()).

The conversions between ARKit and the camera coordinate system should be correct as comparing the result of the conversions back and forth return the same pose. Do you have any ideas what I could be doing wrong?

fspindle commented 2 years ago

The error could be due to wrong camera parameters or parameters not taken into account. A good way to debug is to print the values of the cMo translation part to see is the Z value is correct.

arvinkx commented 2 years ago

The cMo value for Z seems off, it is returning Camera translation: 0.0014916211, -0.03319591, 0.0149277635 cMo translation: 0.0002749004, 4.1352218e-05, -0.69968015 Final translation after multiply: 0.0017171719, -0.03318469, -0.6848147 but the actual distance that is measured is closer to -0.45.

For camera parameters I am using the ARKit intrinsics (https://developer.apple.com/documentation/arkit/arcamera/2875730-intrinsics), I am passing in fx and fy as px and py and I'm using ox and oy for u0 and v0. The ARKit docs state that fx and fy are the pixel focal lengths and ox and oy are "the offsets of the principal point from the top-left corner of the image frame." Are the camera parameters I'm using seem correct?

s-trinh commented 2 years ago

Can you print the values of fx, fy, ox, oy, cMo? You should have roughly ox, oy equal to half your image size.

Can you print also ARKitCameraTransform? If it is the pose to convert to the ARKit frame, you should premultiply.

arvinkx commented 2 years ago

The values for fx, fy, ox, oy seem right. The ARKitCameraTransform is the transform matrix representing the position / rotation for the camera in the ARKit world.

Right now, I'm multiplying the estimated pose that I converted from camera coordinate system to ARKit by the camera matrix to calculate the transform in ARKit space Final transform = ARKitCameraTransform * cMo from tracker after conversion

arvinkx commented 2 years ago

@s-trinh Here are the values:

Image Size: 1440px x 1920px

Intrinsics: px: 1396.25 py: 1396.25 u0: 716.742 v0: 959.121

cMo cMo.print(): -0.0008518936692 -0.0004623166191 -0.7077294253 -3.076831363 0.009373553497 0.03675042757

ARKitCameraTransform (showing just tx, ty, tz) i.e. the ARKit camera's position: -0.030184915, 0.002802756, 0.049276568

Once I receive cMo I apply the conversion we discussed prior to go from ViSP camera coordinate system to ARKit coordinate system to get aMo

aMo (tx, ty, tz): -0.0008518937, -0.00046231662, -0.7077294

Final translation for virtual content (calculated using ARKitCameraTransform * aMo): -0.1090904, -0.24696982, -0.6081954

The actual Z-value should be around -0.4 or -0.45 which is the actual measured distance. I'm thinking the camera calibration is off. Do the camera parameters above seem reasonable? If so, does the transformation I'm applying to cMo / aMo seem correct? Thanks again for all the help with this.

s-trinh commented 2 years ago

Your intrinsic values look coherent:

The cMo values seem odd for me:

aMo values are also odd for me:

To display correctly the frame, you should not need the position of the camera ARKitCameraTransform (position of the camera with respect to what?). For OpenGL (or whatever rendering library used), you should just need to transform the estimated cMo to the OpenGL coordinates system and format the values in memory to respect the column-major order.

For sure, you would need ARKitCameraTransform if you want to transform this pose to another coordinates system (e.g. some IMU coordinates system).


Long time ago, I had coded some model-based tracking and augmented reality demo. If it can help, some code (toGl() function should be what you need):

static void drawFrame(ImVec2 p, float scaleX, float scaleY, const vpCameraParameters &cam, const vpHomogeneousMatrix &cMo,
                      float size, ImU32 color, float thickness) {
    vpPoint o(0.0, 0.0, 0.0);
    vpPoint x(size, 0.0, 0.0);
    vpPoint y(0.0, size, 0.0);
    vpPoint z(0.0, 0.0, size);

    o.project(cMo);
    x.project(cMo);
    y.project(cMo);
    z.project(cMo);

    vpImagePoint ip0, ip1;
    ImColor axesColor[3];
    if (color == 0) {
        axesColor[0] = ImColor(255, 0, 0);
        axesColor[1] = ImColor(0, 255, 0);
        axesColor[2] = ImColor(0, 0, 255);
    } else {
        axesColor[0] = color;
        axesColor[1] = color;
        axesColor[2] = color;
    }

    vpMeterPixelConversion::convertPoint(cam, o.p[0], o.p[1], ip0);

    vpMeterPixelConversion::convertPoint(cam, x.p[0], x.p[1], ip1);
    {
        ImVec2 start(p.x + static_cast<float>(ip0.get_u()*scaleX), p.y + static_cast<float>(ip0.get_v()*scaleY));
        ImVec2 end(p.x + static_cast<float>(ip1.get_u()*scaleX), p.y + static_cast<float>(ip1.get_v()*scaleY));
        ImGui::GetWindowDrawList()->AddLine(start, end, axesColor[0], thickness);
    }

    vpMeterPixelConversion::convertPoint(cam, y.p[0], y.p[1], ip1);
    {
        ImVec2 start(p.x + static_cast<float>(ip0.get_u()*scaleX), p.y + static_cast<float>(ip0.get_v()*scaleY));
        ImVec2 end(p.x + static_cast<float>(ip1.get_u()*scaleX), p.y + static_cast<float>(ip1.get_v()*scaleY));
        ImGui::GetWindowDrawList()->AddLine(start, end, axesColor[1], thickness);
    }

    vpMeterPixelConversion::convertPoint(cam, z.p[0], z.p[1], ip1);
    {
        ImVec2 start(p.x + static_cast<float>(ip0.get_u()*scaleX), p.y + static_cast<float>(ip0.get_v()*scaleY));
        ImVec2 end(p.x + static_cast<float>(ip1.get_u()*scaleX), p.y + static_cast<float>(ip1.get_v()*scaleY));
        ImGui::GetWindowDrawList()->AddLine(start, end, axesColor[2], thickness);
    }
}

static glm::mat4 getPerspectiveView(const vpCameraParameters &cam, unsigned int width, unsigned int height)
{
    double fx = cam.get_px();
    double fy = cam.get_py();
    double fovy = 2*atan(0.5*height/fy);
    double aspect = (width*fy)/(height*fx);

    // define the near and far clipping planes
    float zNear = 0.1f;
    float zFar = 100.0f;

//    glm::mat4 proj = glm::perspective((float)fovy, (float)width/(float)height, zNear, zFar);
    glm::mat4 proj = glm::perspective((float)fovy, (float)aspect, zNear, zFar);

    return proj;
}

static glm::mat4 toGl(const vpHomogeneousMatrix &cMo)
{
    vpHomogeneousMatrix cvToGl;
    cvToGl[1][1] = -1;
    cvToGl[2][2] = -1;
    vpHomogeneousMatrix gMo = cvToGl * cMo;

    glm::mat4 pose;
    for (int i = 0; i < 3; i++) {
        for (int j = 0; j < 3; j++) {
            pose[i][j] = static_cast<float>(gMo[i][j]);
        }
        pose[i][3] = static_cast<float>(gMo[i][3]);
    }

    return glm::transpose(pose);
}

static void drawAxes()
{
    double size = 0.1;

    glPushAttrib(GL_CURRENT_BIT);

    //X = red
    glPushMatrix();
    glColor4d(1,0,0,0.5);
    glBegin(GL_LINES);
    glVertex3d(0, 0, 0); glVertex3d(size, 0, 0);
    glEnd();
    glPopMatrix();

    //Y = green
    glPushMatrix();
    glColor4d(0,1,0,0.5);
    glBegin(GL_LINES);
    glVertex3d(0, 0, 0); glVertex3d(0, size, 0);
    glEnd();
    glPopMatrix();

    //Z = blue
    glPushMatrix();
    glColor4d(0,0,1,0.5);
    glBegin(GL_LINES);
    glVertex3d(0, 0, 0); glVertex3d(0, 0, size);
    glEnd();
    glPopMatrix();

    glPopAttrib();
}

No idea what this code does:

glm::mat4 ProjectionMatrix = getPerspectiveView(cam, I_color.getWidth(), I_color.getHeight());
glm::mat4 ViewMatrix = toGl(cMo*oMmodel*modelMmodelOri);
glm::mat4 ModelMatrix;
glm::mat4 scaling = glm::scale(glm::vec3(0.04f, 0.04f, 0.04f));
glm::mat4 MVP = ProjectionMatrix * ViewMatrix * scaling * ModelMatrix;

// Send our transformation to the currently bound shader,
// in the "MVP" uniform
glUniformMatrix4fv(MatrixID, 1, GL_FALSE, &MVP[0][0]);
glUniformMatrix4fv(ModelMatrixID, 1, GL_FALSE, &ModelMatrix[0][0]);
glUniformMatrix4fv(ViewMatrixID, 1, GL_FALSE, &ViewMatrix[0][0]);
arvinkx commented 2 years ago

@s-trinh

cMo.print()=-0.0008518936692 -0.0004623166191 -0.7077294253 -3.076831363 0.009373553497 0.03675042757: why only 6 values? You should have 4x4=16 values, else is it formatted using translation + rotation vectors convention? If so, one of these values should be tz

This is what cMo.print() gave me and I believe that 0.7077294253 is tz

To display correctly the frame, you should not need the position of the camera ARKitCameraTransform (position of the camera with respect to what?).

The position / rotation that ARKitCameraTransform represents is the transform from the rendering engine's world space origin. My understanding (which might not be correct) is that I need to multiply by the camera transform because in ARKit when the AR session starts the point at which the camera is located in world space becomes the origin of the world so (0,0,0) is where the camera started. Now, when I get the estimated pose from ViSP, I assume that the pose is relative to the camera so if I want to know what the world space transform in ARKit is I would need to multiply cMo by the camera transform to get the position in world space using the relative position from the camera. If I understand correctly, vpMbGenericTracker is returning the relative transform to the camera and I'm using that to determine the world space transform to apply to my virtual content. Does that seem correct?

I see in the code example you added in the section where you're not sure what the code does - it looks like there is some multiplication happening other than the toGl function which could be similar to what I'm referring to. There are some references to variables (i.e.oModel, modelMmodelOri) which I'm not sure what they represent but might be helpful.

s-trinh commented 2 years ago

Your printed tz is negative which is not possible. And if the focal length is off by a factor, tx,ty should be coherent which is not.

This is old code of mine. The extra transformations were for handling some fancy display of other 3d model. It is not related to your issue.

What I understand of the ARKit coordinates system, you should have something like this: ARKitWorld_M_ARKitCam x ARKitCam_M_c x c_M_o.