naver / dust3r

DUSt3R: Geometric 3D Vision Made Easy
https://dust3r.europe.naverlabs.com/
Other
4.66k stars 517 forks source link

Options to do SLAM for video? Get poses and camera intrinsics? #8

Open alik-git opened 4 months ago

alik-git commented 4 months ago

Hi there, congrats on the fantastic work! These are amazing results.

I'm working on 3D mapping systems for robotics, and was wondering if

Given a video, can this method help with obtaining the camera parameters, and poses for each frame?

Do you guys have any scripts already for this? I see that in the example usage you have:

    # retrieve useful values from scene:
    imgs = scene.imgs
    focals = scene.get_focals()
    poses = scene.get_im_poses()

And you can do scene.get_intrinsics() which is great, but when I run this on 12 images from the replica dataset, scene.get_intrinsics() outputs 12 different intrinsic matrices, none of which really match the original camera intrinsics of the replica dataset.

Am I doing something wrong? Should I specify the scale or resolution or something else about the images at some point? The replica images are 1200x600 (w,h) but they get resized to 512 I'm assuming.

Just wondering how I should go about getting the camera parameters for a monocular rgb video, or if that's not really possible to do super accurately yet with this method.

For extra detail, I'm using the following frames from the replica dataset

    image_filenames = [
        'frame000000.jpg', 'frame000023.jpg', 'frame000190.jpg', 'frame000502.jpg',
        'frame000606.jpg', 'frame000988.jpg', 'frame001181.jpg', 'frame001374.jpg',
        'frame001587.jpg', 'frame001786.jpg', 'frame001845.jpg', 'frame001928.jpg'
    ]
...
    images = load_images(images_path_list, size=512)
    pairs = make_pairs(images, scene_graph='complete', prefilter=None, symmetrize=True)
...
    scene = global_aligner(output, device=device, mode=GlobalAlignerMode.PointCloudOptimizer)
    loss = scene.compute_global_alignment(init="mst", niter=niter, schedule=schedule, lr=lr)
...

and the output of the scene.get_intrinsics() is as follows, I'm only showing two of the matrices here, not all 12:

print(scene.get_intrinsics())
tensor([[[250.8425,   0.0000, 256.0000],
         [  0.0000, 250.8425, 144.0000],
         [  0.0000,   0.0000,   1.0000]],
...
        [[250.7383,   0.0000, 256.0000],
         [  0.0000, 250.7383, 144.0000],
         [  0.0000,   0.0000,   1.0000]]], device='cuda:0',
       grad_fn=<CopySlices>)

compared to the ground truth camera params of the replica dataset from the camera_params.json

K_given = [
    [600.0, 0, 599.5],
    [0, 600.0, 339.5],
    [0, 0, 1]
]

here is the actual camera_params.json file in case it helps

{
    "camera": {
        "w": 1200,
        "h": 680,
        "fx": 600.0,
        "fy": 600.0,
        "cx": 599.5,
        "cy": 339.5,
        "scale": 6553.5
    }
}

Also, just curious, how would I go about running this on long videos? Or is that not possible yet?

My apologies if these are too many questions! This method is really awesome, and I'm having a lot of fun using it. Thanks again for the wonderful work!

hwanhuh commented 4 months ago

Although the task is a little different, I converted the dust3r result into a NeRF transform matrix and used it to train NeRF. Below is my code.

  focals = scene.get_focals().cpu()
  extrinsics= scene.get_im_poses().cpu()
  intrinsics = scene.get_intrinsics().cpu()

  def make_nerf_transform_matrix_json(focals, intrinsics, extrinsics, scale=1):
      js_file = OrderedDict()

      focals = (to_numpy(focals) * scale).tolist()
      scaled_width = int(intrinsics[0, :, 2][0].item() * scale * 2)
      scaled_height = int(intrinsics[0, :, 2][1].item() * scale * 2)
      aspect_ratio = scaled_width / scaled_height

      # rotate the extrinsic matrix to the NeRF space
      frames = [] 
      for i, ext in enumerate(to_numpy(extrinsics)):
          ext[:3, 3:] *= scale
          temp_dict = {
                      'file_path': 'images/image_%03d.png'%(i), 
                      'transform_matrix': (ext @ OPENGL).tolist(), # OPENGL = np.array([[1, 0, 0, 0], [0, -1, 0, 0], [0, 0, -1, 0], [0, 0, 0, 1]])
                      'fl_x': focals[i][0],
                      'fl_y': focals[i][0],
                      }
          frames.append(temp_dict)

      js_file['w'] = scaled_width
      js_file['h'] = scaled_height
      js_file['cx'] = scaled_width / 2.0
      js_file['cy'] = scaled_height / 2.0
      js_file['frames'] = frames

      # dust3r does not know the distortion params 
      js_file['k1'] = 0.0
      js_file['k2'] = 0.0
      js_file['p1'] = 0.0
      js_file['p2'] = 0.0

      with open('./transforms.json', 'w', encoding='utf-8') as file: 
          json.dump(js_file, file, ensure_ascii=False, indent='\t')

Here, the 'scale' variable is calculated in 'load_image' function in dust3r.utils.image.

Unfortunately, NeRF results with those converted camera poses are terrible. I wonder if my code has some mistake, or if the recovered dust3r's camera params are not suitable for Image-based rendering like NeRF.

MisterEkole commented 4 months ago

Hi @alik-git Dust3r isn't suitable to recover camera params. It's true they have functions that help recover camera params but these params do not correspond to a any real world camera model. Hence it will be highly inefficient to use dust3r to recover cam intrisincs, I mean its clearly stated in their paper, their implementation doesn't enforce any geometrical constraint.

Dust3r is efficient for reconstruction for some scenes, it wouldn't be ideal for slam based task, unless it is coupled with some classical technique or after performing some camera calibraition and fintetuning on your own (data and cam model) maybe.

Leonard-Yao commented 3 months ago

Hi, Dust3r is a wonderful work and I want to get poses between two images from the real world. So I wonder the datas from "poses = scene.get_im_poses()" is relative pose estimate or Absolute Pose Estimation?Thank U. The datas is as follows:

poses = tensor([[[ 9.9999e-01, -3.6112e-03, 3.9793e-03, -3.2458e-05], [ 3.5933e-03, 9.9998e-01, 4.4953e-03, -1.0670e-03], [-3.9955e-03, -4.4809e-03, 9.9998e-01, -1.4901e-04], [ 0.0000e+00, 0.0000e+00, 0.0000e+00, 1.0000e+00]], ...

    [[ 9.9867e-01, -4.9443e-02, -1.4822e-02, -5.3161e-04],
     [ 4.9266e-02,  9.9871e-01, -1.2068e-02,  6.6914e-04],
     [ 1.5399e-02,  1.1321e-02,  9.9982e-01,  1.0587e-04],
     [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  1.0000e+00]]],
   device='cuda:0', grad_fn=<CopySlices>)
vincent-leroy commented 3 months ago

Thanks a lot Hello everyone,

I'll start by answering @alik-git 's questions: The scene.get_intrinsics() function returns the intrinsics evaluated for the downscaled images e.g. maxdim=512

If you started with images of shape 680x1200 they get reshaped to 290x512, that is a factor of 2.34375 If you want the intrinsics of the image in the original resolution you thus need to multiply optical center and focal length them by this factor, that gives:

tensor([[587.9121,   0.0000, 600.0000],
        [  0.0000, 587.9121, 337.5000],
        [  0.0000,   0.0000,   1.0000]])

which looks pretty close to what you gave as GT =)

vincent-leroy commented 3 months ago

@hwanhuh I gave a quick look at your code and it seems your a mistakenly considering that scene.get_im_poses() returns camera matrices (world2cam) where it returns cam2world matrices: https://github.com/naver/dust3r/blob/517a4301745c650730f65482fb6de9889c0b2db9/dust3r/cloud_opt/optimizer.py#L134

You should thus invert them to get camera projection matrices

vincent-leroy commented 3 months ago

@MisterEkole it is true that the implicit camera model of DUSt3R is an over-parametrization of standard camera models, that enforce less constraints. However, we train only on data captured with "pinhole" camera models, and we show in additional experiments that the predictions closely follow the pinhole camera model even though we do not enforce it explicitly.

The experiments are as follow: we compute average absolute error of field-of-view estimates (in degrees) and the average 2D reprojection accuracy (in %) at a threshold of 1% of the image diagonal, both measured on raw predictions for 1000 randomly sampled (unseen) test images from the Habitat, BlendedMVS and CO3D datasets. Note that Habitat and BlendedMVS are synthetically generated, thus the intrinsics are perfectly known. For CO3D, we consider approximate focals estimated via COLMAP. image In conclusion, DUSt3R predicts a 3D geometry that closely respects the pinhole camera model and allows for reliable focal length estimation, even in the monocular case. Of course, there are degenerate cases and innacuracies for the problem of focal estimation from one or few views is ill-posed and one cannot always rely upon data priors.

vincent-leroy commented 3 months ago

@Leonard-Yao The world2cam matrices are expressed in a common reference frame although the term "absolute" might be misleading: the poses are not expressed in a metric space, because the reconstruction is performed up to an unknown scale factor.

Leonard-Yao commented 3 months ago

@Leonard-Yao The world2cam matrices are expressed in a common reference frame although the term "absolute" might be misleading: the poses are not expressed in a metric space, because the reconstruction is performed up to an unknown scale factor.

@vincent-leroy Thank U for you patient reply. After get depths of every pixels in paired pictures, whether I can figure out the unknown scale factor according to the returned 3D points? such as factor = real depth / returned depth.

alik-git commented 3 months ago

Thanks for answering my question @vincent-leroy!

And just a follow-up question, what do you think about getting poses? Basically, do you think this can be extended into a full-fledged slam system? Or would that be a lot of work.

vincent-leroy commented 3 months ago

I think it should not be too hard to make a functional a slam system with it but some points need to be figured out. We might look into that in the future, but this is not the top priority

Mr-ZhuJun commented 3 months ago

Although the task is a little different, I converted the dust3r result into a NeRF transform matrix and used it to train NeRF. Below is my code.

  focals = scene.get_focals().cpu()
  extrinsics= scene.get_im_poses().cpu()
  intrinsics = scene.get_intrinsics().cpu()

  def make_nerf_transform_matrix_json(focals, intrinsics, extrinsics, scale=1):
      js_file = OrderedDict()

      focals = (to_numpy(focals) * scale).tolist()
      scaled_width = int(intrinsics[0, :, 2][0].item() * scale * 2)
      scaled_height = int(intrinsics[0, :, 2][1].item() * scale * 2)
      aspect_ratio = scaled_width / scaled_height

      # rotate the extrinsic matrix to the NeRF space
      frames = [] 
      for i, ext in enumerate(to_numpy(extrinsics)):
          ext[:3, 3:] *= scale
          temp_dict = {
                      'file_path': 'images/image_%03d.png'%(i), 
                      'transform_matrix': (ext @ OPENGL).tolist(), # OPENGL = np.array([[1, 0, 0, 0], [0, -1, 0, 0], [0, 0, -1, 0], [0, 0, 0, 1]])
                      'fl_x': focals[i][0],
                      'fl_y': focals[i][0],
                      }
          frames.append(temp_dict)

      js_file['w'] = scaled_width
      js_file['h'] = scaled_height
      js_file['cx'] = scaled_width / 2.0
      js_file['cy'] = scaled_height / 2.0
      js_file['frames'] = frames

      # dust3r does not know the distortion params 
      js_file['k1'] = 0.0
      js_file['k2'] = 0.0
      js_file['p1'] = 0.0
      js_file['p2'] = 0.0

      with open('./transforms.json', 'w', encoding='utf-8') as file: 
          json.dump(js_file, file, ensure_ascii=False, indent='\t')

Here, the 'scale' variable is calculated in 'load_image' function in dust3r.utils.image.

Unfortunately, NeRF results with those converted camera poses are terrible. I wonder if my code has some mistake, or if the recovered dust3r's camera params are not suitable for Image-based rendering like NeRF.

Hello, I would like to ask, if the pose estimated by dust3r is put into nerf, will there be ghost blur in the nerf rendering?

andybak commented 3 months ago

Not sure why this is closed. @alik-git can you reopen? It's a fruitful discussion.

alik-git commented 3 months ago

Sure, no problem. Reopening.

Bin-ze commented 3 months ago

@

Although the task is a little different, I converted the dust3r result into a NeRF transform matrix and used it to train NeRF. Below is my code.

  focals = scene.get_focals().cpu()
  extrinsics= scene.get_im_poses().cpu()
  intrinsics = scene.get_intrinsics().cpu()

  def make_nerf_transform_matrix_json(focals, intrinsics, extrinsics, scale=1):
      js_file = OrderedDict()

      focals = (to_numpy(focals) * scale).tolist()
      scaled_width = int(intrinsics[0, :, 2][0].item() * scale * 2)
      scaled_height = int(intrinsics[0, :, 2][1].item() * scale * 2)
      aspect_ratio = scaled_width / scaled_height

      # rotate the extrinsic matrix to the NeRF space
      frames = [] 
      for i, ext in enumerate(to_numpy(extrinsics)):
          ext[:3, 3:] *= scale
          temp_dict = {
                      'file_path': 'images/image_%03d.png'%(i), 
                      'transform_matrix': (ext @ OPENGL).tolist(), # OPENGL = np.array([[1, 0, 0, 0], [0, -1, 0, 0], [0, 0, -1, 0], [0, 0, 0, 1]])
                      'fl_x': focals[i][0],
                      'fl_y': focals[i][0],
                      }
          frames.append(temp_dict)

      js_file['w'] = scaled_width
      js_file['h'] = scaled_height
      js_file['cx'] = scaled_width / 2.0
      js_file['cy'] = scaled_height / 2.0
      js_file['frames'] = frames

      # dust3r does not know the distortion params 
      js_file['k1'] = 0.0
      js_file['k2'] = 0.0
      js_file['p1'] = 0.0
      js_file['p2'] = 0.0

      with open('./transforms.json', 'w', encoding='utf-8') as file: 
          json.dump(js_file, file, ensure_ascii=False, indent='\t')

Here, the 'scale' variable is calculated in 'load_image' function in dust3r.utils.image. Unfortunately, NeRF results with those converted camera poses are terrible. I wonder if my code has some mistake, or if the recovered dust3r's camera params are not suitable for Image-based rendering like NeRF.

Hello, I would like to ask, if the pose estimated by dust3r is put into nerf, will there be ghost blur in the nerf rendering?

I'm a nerf researcher and I'm very interested in applying dust3 with nerf modeling. I would like to know what the results of your experiments are so far. Can you show the render results you got using nerf under this issue? Maybe we can figure out a way to make it work together. think you!

PHOENIXFURY007 commented 3 months ago

Thanks a lot Hello everyone,

I'll start by answering @alik-git 's questions: The scene.get_intrinsics() function returns the intrinsics evaluated for the downscaled images e.g. maxdim=512

If you started with images of shape 680x1200 they get reshaped to 290x512, that is a factor of 2.34375 If you want the intrinsics of the image in the original resolution you thus need to multiply optical center and focal length them by this factor, that gives:

tensor([[587.9121,   0.0000, 600.0000],
        [  0.0000, 587.9121, 337.5000],
        [  0.0000,   0.0000,   1.0000]])

which looks pretty close to what you gave as GT =)

This looks really interesting. @vincent-leroy , can you share the code as to how you got the scaled-down intrinsics ?

PHOENIXFURY007 commented 3 months ago

Is it possible to get the focals and intrinsic value for in-the-wild samples? Assuming the values to be almost similar to the real value and not exact

ljjTYJR commented 2 months ago

For the SLAM system, I wonder how to maintain the scale consistently, especially when the scene changes (like from outdoor to indoor)