Adapting to a different camera model

dprov commented 2 years ago

Hi, thanks for making this repo available! I've been trying to get this to work for a different camera model (cv.omnidir in OpenCV contrib), so I've spent quite a bit of time trying to understand what's going on overall and generating the vector fields. I'm nearly there, but I have a few questions that are still puzzling me. If you can help clear that up a bit, that would be very appreciated.

1) In https://gist.github.com/menandro/b829667f616e72aded373479aca61770, when generating vectorxforward and vectorxbakward, a mix of K0 and D1 is used and that confuses me. I don't understand the paper 100%, but I would have thought the second camera's distortion was already included in the calibration field. Note however, that using any combination of D0,D1,K0,K1 has relatively little impact on the disparity results (aka I've got bigger problems :P)

thetad = theta.*(1 + D1(1)*theta.^2 + D1(2)*theta.^4 + D1(3)*theta.^6 + D1(4)*theta.^8);
...
u1 = K0(1,1) * xprime1 + K0(1,3); % Should be K0 because surface is already corrected

2) In https://gist.github.com/menandro/cd5f4b5309f16f1a0f1987fcb2baf057, the translateScale is not the same in the forward and backward direction. I don't understand the logic behind that. In my case, I used the same scale in forward/backward direction, but using either a scale of 1 or 0.01 also has little impact on disparity maps.

%% Forward direction
translateScale = 1.0;
...
%% Backward direction
translateScale = 0.01;

3) Just to make sure, does the disparity relate to the raw (i.e. distorted) images? (e.g. a given point in the first image at position (u,v) will be at (u+disp.x, v+disp.y) in the second image?). Asking because for the equidistant dataset, the disparities I get are coherent with that. However, both for the t265 "robot" image (Kannala-Brandt model) and for my own data (Mei model), the disparity maps look pretty good in general, but the values themselves seem off. E.g. for a large patch of an image where I expect a x disparity of ~ -40, the results are smooth and coherent, but the disparity is ~ -20. I've tried quite a few variations of parameters (nLevel, fScale, nWarpIters, nSolverIters, lambda, limitRange) and this remains.

4) In https://github.com/menandro/vfs/blob/master/stereotgv/conversion.cu, I don't quite understand how the triangulation is done, but it feels like

    float xprime0 = (u0 - focalx) / cx;
    float yprime0 = (v0 - focaly) / cy;
...
    float xprime1 = (u1 - focalx) / cx;
    float yprime1 = (v1 - focaly) / cy;

Should rather be xprime0 = (u0 - cx) / focalx and so on. Is that correct?

dprov commented 2 years ago

Ok so having gone through the code more in depth, it would seem clear to me that the answer to point 3 above is "No".

The second image is pre-warped using the calibration vector, and only then do we iteratively solve for the disparity. Since the calibration field is never added back to the disparity, I would have expected the following correspondence between the first and second raw images: (u,v) -> (u + disp.x + calibField(u,v).x, v + disp.y + calibField(u,v).y)

To test this, I've found a few ground truth correspondences manually between the provided robot1.png/robot2.png images as well as for 5 pairs of images I took (and used the appropriate calibration fields). What I find very puzzling, is that for each image, results are significantly closer when skipping the y component of the calibration field, i.e.: (u,v) -> (u + disp.x + calibField(u,v).x, v + disp.y)

This makes me uncertain regarding stereo reconstruction (e.g. would I get better results by taking out the y component of the calibration field form the disparity?)

Any insight on this issue would be helpful.

menandro commented 2 years ago

1) K0,D0,K1,D1 are all used because of the way I solved the trajectory field. It involves projecting a point from image0 to an arbitrary surface then reproject to image1. To project out of image0, we need K0,D0 and to reproject to image1, we need K1,D1. Maybe this video will help. https://www.youtube.com/watch?v=fbv_LJxHEKQ @4:54

2) The tscale is only used to generate a vector and get the direction. In the end, the vector is normalized, so the magnitude initial translation from tscale is ignored.

3) You are right. The disparity is after calibration, so if you try to generate the second image by warping using the disparities, you'll get a calibrated second image.

4) This is definitely a bug. For future reference, the conversion.cu is only added for visualization. In the paper, I didn't evaluate (accuracy etc.) using depth, but only with disparity.

This looks correct: (u,v) -> (u + disp.x + calibField(u,v).x, v + disp.y + calibField(u,v).y) However, I am not sure what you are trying to confirm. What kind of result do you mean is "significantly closer" (e.g. depth, disparity, image, warped image)?

dprov commented 2 years ago

Thanks for your reply and the video link!

I understand that K0, D0, K1 and D1 are needed overall, but I'm still unsure about some details, even having looked at the paper, code and video. Here's my take + what confuses me:

Details

- Create (u,v)_0 surface - Unproject from camera0 onto arbitrary 3D surface (i.e. at arbitrary Z): (X,Y,Z)_0 = unproject((u,v)_0, **K0, D0**) - Compute calibration field: - Rotate points to bring them into camera1's referential (except for translation): (X,Y,Z)_1 = R * (X,Y,Z)_0 - Project onto camera1: (u,v)_1 = project( (X,Y,Z)_1 , **K1, D1**) - calibField = (u,v)_1 - (u,v)_0 - Compute translation field - Repeat for forward and backward directions - Apply translation from stereo calibration to 3d points to bring them into camera1's perspective (assume rotation = identity): (X,Y,Z)_1 = (X,Y,Z)_0 + T*scale - Project onto calibration-corrected camera1 (image1 will already have been warped by the calibration field when the trajectory field is used) : (u,v)_1 = project( (X,Y,Z)_1 , **K0, D0**) - **NOTE** In generateVectorFieldNewton.m (lines 75-79, 93-97), it's rather (u,v)_1 = project( (X,Y,Z)_1 , **K0, D1**). This specific mix of K0 and D1 is what confuses me. - transField = (u,v)_1 - (u,v)_0 - translation_field = normalize(transField_forward - transField_backward)

Thanks for the clarification.
(+ comment at the end of your answer) : I want to confirm whether the computed disparity is correct (i.e. validate that the vector field generation for my camera model works, the combination of solver parameters I selected makes sense, results good enough for my use case, etc.). The easiest way for me to do this was to manually find some ground truth correspondences between images and compare those to the computed disparity. Obviously, computed disparities were way off from the raw image to raw image ground truth (which in hindsight is to be expected). That's what got me investigating and realize that correspondences should be of the form (u,v) -> (u + disp.x + calibField(u,v).x, v + disp.y + calibField(u,v).y)

However, when averaging the absolute disparity error across multiple correspondences in an image pair, adding the calibField improves results along x, but degrades them along y. This is what confuses me and makes me wonder if there's a bug or if I misunderstand something. For reference, I found at least 5 correspondences across 5 of my own image pairs, and also in the robot1/2 samples provided. The same conclusion applies to each of those image pairs individually. For example:

Details

``` # Disparity check for image pair (robot1.png, robot2.png) # NOTE: GT is found manually between pairs of raw images # (dx,dy) = (disp.x, disp.y) vs GT mean(|dx_GT - dx|) = 6.87 mean(|dy_GT - dy|) = 1.09 mean(sqrt((dx_GT-dx)² + (dy_GT-dy)²)) = 6.98 # (dx,dy) = (disp.x + calibField.x, disp.y + calibField.y) vs GT mean(|dx_GT - dx|) = 3.64 mean(|dy_GT - dy|) = 3.81 mean(sqrt((dx_GT-dx)² + (dy_GT-dy)²)) = 6.19 # (dx,dy) = (disp.x + calibField.x, disp.y) vs GT mean(|dx_GT - dx|) = 3.64 mean(|dy_GT - dy|) = 1.09 mean(sqrt((dx_GT-dx)² + (dy_GT-dy)²)) = 4.29 ```

Thanks for the confirmation.

menandro / vfs

Adapting to a different camera model #13