visinf / multi-mono-sf

Self-Supervised Multi-Frame Monocular Scene Flow (CVPR 2021)
Apache License 2.0
99 stars 18 forks source link

Scene flow of static background #4

Closed rohaldb closed 3 years ago

rohaldb commented 3 years ago

I have noticed when evaluating the pre-trained model on a dataset that has camera motion, that the model predicts large scene flow for the background objects that are actually static. As far as I can tell, this is an unintended consequence - is that correct? In theory at least this yields invalid results, since to have non 0 scene flow vectors in world space would mean that the static objects are moving.

I discovered this on my custom dataset, but it can also be seen on the Davis camel scene, in which the camel is moving but the background is static (in world coordinates). Example input frame:

00010 An example scene flow (visualisation) output is as follows: download (The red and blue squares have been manually annotated by me for the following reason)

Inspecting the values at the red (corresponding to static objects) and blue (corresponding to dynamic objects) squares, we see that the scene flow is larger at the red than at the blue:

(max and min are across all three sf dimensions)

My use of this model is contingent on it's ability to predict the sf of static objects as 0. Is this a known issue, or am I missing something.

Thanks so much!

hurjunhwa commented 3 years ago

Hi,

The model estimates the scene flow relative to the camera. Thus, if there is a camera ego-motion, scene flow at the static regions has a non-zero value (and will be the relative 3D motion to the camera). If the camera stays still, then the model should output zero scene flow for the static region (hopefully!).

If you want to estimate 3D scene flow defined in the world coordinate, then probably you can first estimate the camera ego-motion from estimated scene flow using any robust estimation methods (such as RANSAC + least square or any others) and then subtract the camera ego-motion. (it can be noisy though..)

By the way, the model is trained only on the KITTI dataset, so it may not work well on the custom dataset. For better accuracy, training or finetuning on the target dataset is recommended.

rohaldb commented 3 years ago

Thanks for a speedy reply!

Since the outputs are motion relative to the camera, I assume that means that the outputs are in view/camera coordinates then? Or is there a difference between world coordinates relative to camera and camera coordinates I am not understanding? I assumed because of https://github.com/visinf/self-mono-sf/issues/3#issuecomment-648103995 that the outputs were in world coordinates.

As part of my setup, I am already given the poses of the camera in each frame, so I already know the camera ego-motion. So if I know the camera translation between frame A and B is X, then can I just subtract X from the model's scene flow predictions at every pixel to obtain the desired result? Do I also need to account for rotation? Any resource you can provide on how to subtract the ego-motion would be appreciated.

Unfortunately, I cannot do fine-tuning or training on my custom dataset as my setup is only a monocular video - I don't have stereo image pairs.

Thank you very much for the help @hurjunhwa - my vision/graphics is not very strong so this is a huge help!

hurjunhwa commented 3 years ago

You are welcome!

I am sorry that my explanation might have been a bit confusing.

https://github.com/visinf/self-mono-sf/issues/3#issuecomment-648103995 meant that the unit of the output scene flow is the meter-scale in the world. (I updated the comment there.. thanks! ) That's why we were able to evaluate our method on KITTI Scene Flow benchmark without using any normalization tricks that other papers do. And, the output itself is relative to the camera, and thus yes, it's defined in the camera coordinate.

In order to get the scene flow in the world coordinate, you need both camera translation and rotation. Given this camera ego-motion (i.e., translation + rotation), you can first calculate the scene flow induced by the camera motion (S_cam) for each pixel, using the estimated depth. Then you can subtract the camera motion (S_cam) from the output scene flow (S_out) to obtain the scene flow at the world coordinate (S_world).

S_world = S_out - S_cam

However, one thing I worry about is that it may not work that well in the custom dataset due to the changes of image intrinsic. The model is trained only for the KITTI. Thus, when testing images in a different resolution or with a different camera intrinsic, it will be likely the case that the model may not output the correct scale of depth and scene flow.

I guess this is the future direction to improve the work.

rohaldb commented 3 years ago

Ah, that makes much more sense, thanks!

And since S_out is in camera coordinates, I assume that S_cam should also be in that same camera's coordinates?

If that is the case (and as a final question) it's not clear to me how to compute S_cam of each point. Would you be able to point me to a resource that explains this? Would It be something along the lines of:

  1. Estimating the 3d world coordinate position of all pixels in the image using the depth
  2. Applying the camera motion (translation and rotation) to each point
  3. Moving the results points back into the cameras coordinate system

My use case is that it is fine for me to only know the scene flow/depth up to scale and shift. In any case, I will try and see the results 😊

Thanks again!

hurjunhwa commented 3 years ago

Yes, S_cam should be also in the same camera's coordinates.

And yes, what you described is basically correct. The simple equation would be like this:

P = D(p) K^(-1) p
S_cam(p) =  M P - P

Here P is the 3D point of the pixel p, and it is obtained by back-projecting the pixel p into 3D space using the given camera instrinsic K and determining its 3D position by using its depth value D(p).

Then, you can get where the 3D point moves by the camera motion by using the camera pose M. The camera pose M is typically represented as a 4-by-4 matrix. (ref)

Finally you can get the 3D scene flow induced by the camera motion by subtracting P from MP.

You can also refer to other papers that jointly estimate camera pose, motion, and depth (e.g., GeoNet). Though they don't estimate scene flow, they are also based on these equations, and you may find them helpful to understand better. :) Reading the textbook, multiview geometry in computer vision, can be also helpful as well.

rohaldb commented 3 years ago

I attempted the above and was unable to get it to work, but I'm going to post what I did here in case it helps anyone in the future, or in case someone knows what is going wrong.

1. Estimating properties

I ran my images through colmap to obtain the camera intrinsics and pose estimates. The sparse binary files can be written to the poses_bounds format of LLFF using this file, and then read using this file. After this process, the pose matrices and camera intrinsic are given in familiar forms.

2. Running the model

Next I ran the multi-mono-sf model on my custom dataset. For example, running these two images

00009 (1) 00010 (1)

yields the following scene flow estimate, which, you'll notice has non zero scene flow for the static background (visualised using compute_color_sceneflow)

00009_sf

3. Removing the scene flow due to camera ego-motion

Modifying @hurjunhwa's code from above slightly, we can measure the scene flow of a pixel p induced by a camera's motion from pose 1 to pose 2 by:

  1. Take a pixel p and back project it to camera 1's view space using P_C1 = D(p) K^(-1) p as per above
  2. Estimate the 3d position of that same point in camera 2's view space (call it P_C2)by first projecting it to world space, and then back to camera 2's view space. Operationally, this amounts to multiplying P_C1 by camera 1's pose matrix and then by camera 2's extrinsic matrix.
  3. The estimated scene flow due to camera motion is then S_cam(p) = P_C2 - P_C1
  4. Finally, subtract the scene flow due to camera motion from the model's output

Accompanying code can be found in this gist.

Visualising S_cam reveals that points farther away in the scene seem to have larger scene flow due to camera motion, which intuitively seems correct: download (1)

However, it doesn't seem to account for the foreground (the child) particularly well, such that when I subtract S_cam from the multi-mono-sf estimates, the results don't resemble what I would've hoped:

download (2)

Rather, if this had worked, we would've expected something more like this:

something.

Perhaps the issue comes from poor multi-mono-sf model outputs on a new dataset, or poor estimates from colmap, or a combination? Alternatively, there is an error in my methodology.

hurjunhwa commented 3 years ago

Wow, thanks for sharing your experiment results!

I guess the problem is combined with multiple factors.

  1. poor multi-mono-sf results It's inevitable due to that the model is trained on only the KITTI dataset.

  2. scale mismatch The scale of multi-mono-sf sceneflow and S_cam is different. Precisely, the scale of depth and scene flow from multi-mono-sf is different from the scale of depth and pose from COLMAP. Probably you need to rescale one of them by using an affine transformation (e.g., ax+b).

  3. the sign of S_cam is reversed? I guess I gotta double-check the equation or codes.. but just by looking at the visualization, I think the sign of S_cam needs to be reversed.. (only one axis or all three axes). The magenta color means that the 3D point moves to the right in the x-axis, but the visualization of S_cam is all green-ish (which means that all points moves to the left in the camera coordinate.).

I guess by correcting 2. and 3., we may get the visualization similar to the last image that you shared, but there may be still lots of non-zero residual scene flow in the static regions..unless training the model in the 'in-the-wild' video.

rohaldb commented 3 years ago

Regarding 3. you are very right! I realised that the data loader i used (originating from NeRF) follows [right, up, backwards] or [x, y, z], while KITTI, and subsequently this codebase, follows [right, down, forward] or [x,-y,-z]. This was a simple adjustment, although it does not resolve the issue:

download-3

Regarding 2. I am experimenting with different affine transformations to see what I can get to work. A trivial way to find a suitable transformation is to simply take a large chunk of points that we know are background (and should have 0 sf) and solve for the affine transformation that sets S_cam=sf in those regions. Doing so yields the following, which is a huge improvement! download (3)

Would you say that is a reasonable way to solve for the transform, or is there a better way you had in mind? In either case, I am quite satisfied with the results here, and I think the above plot nicely highlights the hidden noise in the model estimates. Thanks again for all the help @hurjunhwa !

hurjunhwa commented 3 years ago

Cool! Yes, I think the transformation you used is a reasonable way to do it! When sampling points and estimating affine parameters, using RANSAC to discard outliers (i.e., moving objects) would improve further, I guess.

You are very welcome, and thanks for you sharing your results. I can see lots of non-residual scene flows in the background pixels 😃 .