Closed arvinkx closed 2 years ago
Would this be possible to achieve if using AR, I facilitated the ability for the user to move the live camera to match a specific pose?
Yes, it should be doable. What you are trying to achieve is similar to https://github.com/lagadic/visp/issues/670
But you have to provide an initial cMo
sufficiently close enough to let the tracking converge.
About the vpHomogeneousMatrix
, you can see these links:
@s-trinh Thanks for the information, it was helpful to understand what I need to achieve. I didn't see #670 when searching either.
I am still struggling to get tracking convergence to occur, I keep either getting No data found to compute the interaction matrix...
or a message to adjust the tracker.setGoodMovingEdgesRatioThreshold
. I've tried changing the good moving edges ratio with no luck. Would these messages result from an inaccurate initial pose or could there be another issue?
The initial pose that I am attempting to use is a translation that places the object directly in front of the camera frame (0, -0.08, 0.4) with no rotation - I'm able to visualize said transform to ensure it is correct but tracking isn't able to converge with that initial pose.
It means that your initial pose is maybe too far to initialize the tracker. It could be also an issue with the CAD model that doesn't have faces with normals going outside the object.
To validate the config files I suggest that you acquire a video with your device and try to make the tracker working on that video on a classical computer. To this end you can follow the use cases given in that tutorial and adapt the command line options to use your video. At this point, the tutorial tutorial-mb-generic-tracker-full
allows to initialize the tracking with user click. When this tutorial is working with your video and object, you can replace the tracker.initClick()
by tracker.initFromPose()
to see what's going on.
@fspindle Thanks, I tried using the teabox.cao
that comes with the sample with a teabox and got the same result so I'm thinking it's the initial pose but I'm not sure. I will try to get a video working with the tutorial like you mentioned and see if I can see what's going on.
Also, when I receive the error messages, do I need to call initWithPose
or setPose
until the tracker can converge? Essentially, should I be calling initWithPose
over and over until it converges or do I only call that once and use setPose
?
@fspindle @s-trinh Thanks for the help on getting tracking functioning. I have one last issue I'm struggling with. I'm trying to convert to/from the ARKit/SceneKit coordinate system and ViSP to utilize the estimated pose I've received. I'm struggling to figure out the correct transformation to apply. I'm trying to convert a SCNMatrix4
which is column-major and SceneKit (iOS render engine) is X right, Y up, Z inward. I've tried looking at the docs for vpHomogenousMatrix
and what I've attempted is to invert the columns/rows and then multiply by a matrix containing -1 at (1,1) and (2,2) to invert the y and z axes. Does that seem correct to convert to that coordinate system?
There are two different things for me:
4x4
matrix from row-major to column-majorAbout the 4x4
pose returned by the vpMbGenericTracker
:
if you draw the different coordinate systems:
to convert the cMo
to the ARKit
coordinate systems, you have to premultiply it: aMo = T x cMo
About the row-major to column-major conversion, transposing the matrix should work:
@s-trinh Thanks for the information, I believe I am able to convert between the coordinate systems correctly but I'm still having problems aligning virtual content using the estimated pose that was returned. My virtual content ends up further in ARKit's -Z direction than the actual object (as seen in the attached image - the origin of the teabox/model is the bottom back corner of the box).
The amount the result is offset seems to be the roughtly the same every time I try so I'm not sure if that could be an issue with camera calibration or if I'm incorrectly applying the estimated pose. To generate the world transform, I'm multiplying cMo
by the ARKit camera transform (aMo = cMo * ARKitCameraTransform
) which results in the incorrect pose. The projection error is around 5 - 10 (calling tracker.getProjectionError()
).
The conversions between ARKit and the camera coordinate system should be correct as comparing the result of the conversions back and forth return the same pose. Do you have any ideas what I could be doing wrong?
The error could be due to wrong camera parameters or parameters not taken into account. A good way to debug is to print the values of the cMo translation part to see is the Z value is correct.
The cMo
value for Z seems off, it is returning Camera translation: 0.0014916211, -0.03319591, 0.0149277635 cMo translation: 0.0002749004, 4.1352218e-05, -0.69968015 Final translation after multiply: 0.0017171719, -0.03318469, -0.6848147
but the actual distance that is measured is closer to -0.45.
For camera parameters I am using the ARKit intrinsics (https://developer.apple.com/documentation/arkit/arcamera/2875730-intrinsics), I am passing in fx
and fy
as px
and py
and I'm using ox
and oy
for u0
and v0
. The ARKit docs state that fx
and fy
are the pixel focal lengths and ox
and oy
are "the offsets of the principal point from the top-left corner of the image frame." Are the camera parameters I'm using seem correct?
Can you print the values of fx, fy, ox, oy, cMo
?
You should have roughly ox, oy
equal to half your image size.
Can you print also ARKitCameraTransform
? If it is the pose to convert to the ARKit frame, you should premultiply.
The values for fx, fy, ox, oy
seem right. The ARKitCameraTransform
is the transform matrix representing the position / rotation for the camera in the ARKit world.
Right now, I'm multiplying the estimated pose that I converted from camera coordinate system to ARKit by the camera matrix to calculate the transform in ARKit space Final transform = ARKitCameraTransform * cMo from tracker after conversion
@s-trinh Here are the values:
Image Size: 1440px x 1920px
Intrinsics: px: 1396.25 py: 1396.25 u0: 716.742 v0: 959.121
cMo cMo.print()
: -0.0008518936692 -0.0004623166191 -0.7077294253 -3.076831363 0.009373553497 0.03675042757
ARKitCameraTransform (showing just tx, ty, tz) i.e. the ARKit camera's position: -0.030184915, 0.002802756, 0.049276568
Once I receive cMo
I apply the conversion we discussed prior to go from ViSP camera coordinate system to ARKit coordinate system to get aMo
aMo (tx, ty, tz): -0.0008518937, -0.00046231662, -0.7077294
Final translation for virtual content (calculated using ARKitCameraTransform * aMo
): -0.1090904, -0.24696982, -0.6081954
The actual Z-value should be around -0.4 or -0.45 which is the actual measured distance. I'm thinking the camera calibration is off. Do the camera parameters above seem reasonable? If so, does the transformation I'm applying to cMo / aMo
seem correct? Thanks again for all the help with this.
Your intrinsic values look coherent:
u0
and v0
are approximatively half the image sizepx
and py
are similarThe cMo
values seem odd for me:
cMo.print()=-0.0008518936692 -0.0004623166191 -0.7077294253 -3.076831363 0.009373553497 0.03675042757
: why only 6 values? You should have 4x4=16
values, else is it formatted using translation + rotation
vectors convention? If so, one of these values should be tz
px/py
), you should have tz_real=some_factor x tz_estimated
since it should be just a matter of scaling factoraMo
values are also odd for me:
cMo
using the correct rotation matrix, if the focal length is off (e.g. camera uses zoom but intrinsics are without zoom), you should have values off by the same factor for tx, ty, tz
.To display correctly the frame, you should not need the position of the camera ARKitCameraTransform
(position of the camera with respect to what?). For OpenGL (or whatever rendering library used), you should just need to transform the estimated cMo
to the OpenGL coordinates system and format the values in memory to respect the column-major order.
For sure, you would need ARKitCameraTransform
if you want to transform this pose to another coordinates system (e.g. some IMU coordinates system).
Long time ago, I had coded some model-based tracking and augmented reality demo. If it can help, some code (toGl()
function should be what you need):
static void drawFrame(ImVec2 p, float scaleX, float scaleY, const vpCameraParameters &cam, const vpHomogeneousMatrix &cMo,
float size, ImU32 color, float thickness) {
vpPoint o(0.0, 0.0, 0.0);
vpPoint x(size, 0.0, 0.0);
vpPoint y(0.0, size, 0.0);
vpPoint z(0.0, 0.0, size);
o.project(cMo);
x.project(cMo);
y.project(cMo);
z.project(cMo);
vpImagePoint ip0, ip1;
ImColor axesColor[3];
if (color == 0) {
axesColor[0] = ImColor(255, 0, 0);
axesColor[1] = ImColor(0, 255, 0);
axesColor[2] = ImColor(0, 0, 255);
} else {
axesColor[0] = color;
axesColor[1] = color;
axesColor[2] = color;
}
vpMeterPixelConversion::convertPoint(cam, o.p[0], o.p[1], ip0);
vpMeterPixelConversion::convertPoint(cam, x.p[0], x.p[1], ip1);
{
ImVec2 start(p.x + static_cast<float>(ip0.get_u()*scaleX), p.y + static_cast<float>(ip0.get_v()*scaleY));
ImVec2 end(p.x + static_cast<float>(ip1.get_u()*scaleX), p.y + static_cast<float>(ip1.get_v()*scaleY));
ImGui::GetWindowDrawList()->AddLine(start, end, axesColor[0], thickness);
}
vpMeterPixelConversion::convertPoint(cam, y.p[0], y.p[1], ip1);
{
ImVec2 start(p.x + static_cast<float>(ip0.get_u()*scaleX), p.y + static_cast<float>(ip0.get_v()*scaleY));
ImVec2 end(p.x + static_cast<float>(ip1.get_u()*scaleX), p.y + static_cast<float>(ip1.get_v()*scaleY));
ImGui::GetWindowDrawList()->AddLine(start, end, axesColor[1], thickness);
}
vpMeterPixelConversion::convertPoint(cam, z.p[0], z.p[1], ip1);
{
ImVec2 start(p.x + static_cast<float>(ip0.get_u()*scaleX), p.y + static_cast<float>(ip0.get_v()*scaleY));
ImVec2 end(p.x + static_cast<float>(ip1.get_u()*scaleX), p.y + static_cast<float>(ip1.get_v()*scaleY));
ImGui::GetWindowDrawList()->AddLine(start, end, axesColor[2], thickness);
}
}
static glm::mat4 getPerspectiveView(const vpCameraParameters &cam, unsigned int width, unsigned int height)
{
double fx = cam.get_px();
double fy = cam.get_py();
double fovy = 2*atan(0.5*height/fy);
double aspect = (width*fy)/(height*fx);
// define the near and far clipping planes
float zNear = 0.1f;
float zFar = 100.0f;
// glm::mat4 proj = glm::perspective((float)fovy, (float)width/(float)height, zNear, zFar);
glm::mat4 proj = glm::perspective((float)fovy, (float)aspect, zNear, zFar);
return proj;
}
static glm::mat4 toGl(const vpHomogeneousMatrix &cMo)
{
vpHomogeneousMatrix cvToGl;
cvToGl[1][1] = -1;
cvToGl[2][2] = -1;
vpHomogeneousMatrix gMo = cvToGl * cMo;
glm::mat4 pose;
for (int i = 0; i < 3; i++) {
for (int j = 0; j < 3; j++) {
pose[i][j] = static_cast<float>(gMo[i][j]);
}
pose[i][3] = static_cast<float>(gMo[i][3]);
}
return glm::transpose(pose);
}
static void drawAxes()
{
double size = 0.1;
glPushAttrib(GL_CURRENT_BIT);
//X = red
glPushMatrix();
glColor4d(1,0,0,0.5);
glBegin(GL_LINES);
glVertex3d(0, 0, 0); glVertex3d(size, 0, 0);
glEnd();
glPopMatrix();
//Y = green
glPushMatrix();
glColor4d(0,1,0,0.5);
glBegin(GL_LINES);
glVertex3d(0, 0, 0); glVertex3d(0, size, 0);
glEnd();
glPopMatrix();
//Z = blue
glPushMatrix();
glColor4d(0,0,1,0.5);
glBegin(GL_LINES);
glVertex3d(0, 0, 0); glVertex3d(0, 0, size);
glEnd();
glPopMatrix();
glPopAttrib();
}
No idea what this code does:
glm::mat4 ProjectionMatrix = getPerspectiveView(cam, I_color.getWidth(), I_color.getHeight());
glm::mat4 ViewMatrix = toGl(cMo*oMmodel*modelMmodelOri);
glm::mat4 ModelMatrix;
glm::mat4 scaling = glm::scale(glm::vec3(0.04f, 0.04f, 0.04f));
glm::mat4 MVP = ProjectionMatrix * ViewMatrix * scaling * ModelMatrix;
// Send our transformation to the currently bound shader,
// in the "MVP" uniform
glUniformMatrix4fv(MatrixID, 1, GL_FALSE, &MVP[0][0]);
glUniformMatrix4fv(ModelMatrixID, 1, GL_FALSE, &ModelMatrix[0][0]);
glUniformMatrix4fv(ViewMatrixID, 1, GL_FALSE, &ViewMatrix[0][0]);
@s-trinh
cMo.print()=-0.0008518936692 -0.0004623166191 -0.7077294253 -3.076831363 0.009373553497 0.03675042757: why only 6 values? You should have 4x4=16 values, else is it formatted using translation + rotation vectors convention? If so, one of these values should be tz
This is what cMo.print() gave me and I believe that
0.7077294253
is tzTo display correctly the frame, you should not need the position of the camera ARKitCameraTransform (position of the camera with respect to what?).
The position / rotation that
ARKitCameraTransform
represents is the transform from the rendering engine's world space origin. My understanding (which might not be correct) is that I need to multiply by the camera transform because in ARKit when the AR session starts the point at which the camera is located in world space becomes the origin of the world so (0,0,0) is where the camera started. Now, when I get the estimated pose from ViSP, I assume that the pose is relative to the camera so if I want to know what the world space transform in ARKit is I would need to multiplycMo
by the camera transform to get the position in world space using the relative position from the camera. If I understand correctly,vpMbGenericTracker
is returning the relative transform to the camera and I'm using that to determine the world space transform to apply to my virtual content. Does that seem correct?
I see in the code example you added in the section where you're not sure what the code does - it looks like there is some multiplication happening other than the toGl
function which could be similar to what I'm referring to. There are some references to variables (i.e.oModel
, modelMmodelOri
) which I'm not sure what they represent but might be helpful.
Your printed tz
is negative which is not possible. And if the focal length is off by a factor, tx,ty
should be coherent which is not.
This is old code of mine. The extra transformations were for handling some fancy display of other 3d model. It is not related to your issue.
What I understand of the ARKit coordinates system, you should have something like this: ARKitWorld_M_ARKitCam x ARKitCam_M_c x c_M_o
.
I'm trying to create an implementation of the Markerless generic model-based tracking (https://visp-doc.inria.fr/doxygen/visp-daily/tutorial-tracking-mb-generic.html) on iOS.
I'm trying to implement this without the user interactively tapping points on the screen to determine the initial pose. The objects are texture-less so I don't think I can do keypoint learning either. Would this be possible to achieve if using AR, I facilitated the ability for the user to move the live camera to match a specific pose? Would this pose be what I could feed into the tracker's
initWithPose
?Currently, I setup everything like the tutorial except I'm calling
initWithPose
instead ofinitClick
and when I calltrack
I get an exceptionNo data found to compute the interaction matrix...
. ThecMo
that I'm passing in toinitWithPose
is just an empty vpHomogeneousMatrix, what should this matrix consist of when usinginitWithPose
?