graphdeco-inria / gaussian-splatting

Original reference implementation of "3D Gaussian Splatting for Real-Time Radiance Field Rendering"
https://repo-sam.inria.fr/fungraph/3d-gaussian-splatting/
Other
13.74k stars 1.77k forks source link

Is there a way to use ground truth poses without COLMAP ? #702

Open leblond14u opened 6 months ago

leblond14u commented 6 months ago

Hi,

I'm trying to use ground truth poses from a dataset (ICL NUIM) directly with a custom dataloader but with no success so far (see screenshot below). I'm loading the ground truth poses with a transform from World to camera to camera to world as described in the Nerf dataloading sequence. I can't seem to understand what could be the fundamental problem with my approach and why its not working. I would like to be able to use any GT poses from any datasets. Any help would be highly appreciated !

image

Thanks a lot

altaykacan commented 6 months ago

Hey,

I did a similar thing using SLAM output poses and what helped me was to convert my poses into the specific format used by COLMAP.

Here you can find information on what notation COLMAP uses. What gave me trouble was the fact that they have the quaternions as (qw, qx, qy, qz) (I expected qw last) and that they have the world to camera system transform (most SLAM systems as far as I know output camera to world).

If you make an artificial images.txt with this format using your GT poses (don't forget to add one extra dummy line between each line to cover the fact that COLMAP saves 2 lines per frame) you can use the default readColmapSceneInfo() function from the repo in scene/dataset_readers.py. You would need to provide an initial pointcloud though since 3DGS expects that.

I also saw your newer issue #706 and maybe this link from the COLMAP documentation helps.

If you share your dataloader code I can try to help more. I hope this helps you out! :)

leblond14u commented 6 months ago

Thanks for your answer @altaykacan :)

Indeed the World to Camera (W2C) system is not common in robotics / computer vision application.

So if I understood correctly what you are saying, there is no need to recompute the COLMAP informations after creating the image.txt file ? Basically I can just convert my C2W data to W2C using this transform :


# transform_matrix is a camera-to-world transform
c2w = my_initial_R_T_matrix
# change from OpenGL/Blender camera axes (Y up, Z back) to COLMAP (Y down, Z forward)
c2w[:3, 1:3] *= -1

# get the world-to-camera transform and set R, T
w2c = np.linalg.inv(c2w)
R = w2c[:3,:3].T 
T = w2c[:3, 3]

(Are you using the same transform to get your C2W to W2C transformation ?)

Then convert my rotations R into quaternions (qw, qx, qy, qz) and get my image.txt file filled. And I should be able to get something correct using the COLMAP notation and dataloading sequence.

What I did for now was :

  1. Extract my dataset ground truth C2W transforms
  2. Transforming my C2W transforms to W2C with the above transformation
  3. Cloning the readColmapCameras() and readColmapCameras() to get my dataset W2C transform and intrinsics in the Camera object instead of the COLMAP ones.
  4. Loading my ground truth pointcloud

However I realized that doing so my rendered images of my point-cloud were not aligned with my ground truth images. Indicating a wrong transformation somewhere. I cannot seem to put my finger on where exactly the issue is, which is bugging me. My approach should in the end result in the same result as using the colmap files.

Have a nice day, Best regards

altaykacan commented 6 months ago

Happy to help @leblond14u (at least try to help :D)

Two questions:

  1. Why do you revert the axes? I didn't have to flip any axes in my experiments. Maybe you can try visualizing your pointcloud in some viewer to see which axes are which.
  2. Why do you take the transpose of the W2C rotation matrix? Taking the initial COLMAP implementation as reference, I think that would not be necessary. If you are using readColmapCameras() and readColmapSceneInfo() from the original repo, I guess the best way to go forward is to implement a custom read_extrinsics_text() function (see here) and call that instead of the default one. In read_extrinsics_text(), the default poses from COLMAP are read in. These correspond to the C2W transform matrices as far as I can tell. So I guess if you just take your poses and convert them to C2W instead of W2C with the correct axes it should work. The readColmapCameras() function already takes the transpose of the rotation matrix. I saw in some other issue that it was done to make the CUDA code simpler.
leblond14u commented 6 months ago

Thanks a lot for your help I really need it @altaykacan :P

So for the first one : The camera in the implementation assume a COLMAP camera axes convention and my ground truth data are OpenGL ones. So flipping the axes solve this issue.

For the second one : I just copied the transform used for the Nerf dataset. I also tried without but with no results as well. I guess I will check the use of the image.txt file with my original data and flipped axes to see if it works correctly then. As far as I understood COLMAP was storing the W2C camera transforms. So that's why I'm converting to W2C from my C2W dataset.

PS : When using a computer vision renderer, so with C2W transforms, my renders are right.

leblond14u commented 6 months ago

Maybe you are right, it might just come down to my transposition of my C2W matrix (with OpenGL camera format) to the COLMAP W2C matrix (with COLMAP camera format). Could you give me the transform you used for this C2W to W2C conversion ? And otherwise would you be able to test the dataset I'm using Augmented ICL NUIM with your method to check if it could not come from the actual data ?

Liu-Jinxin commented 6 months ago

Hi Hugo,

I've also been working on something similar to you recently, but I haven't used the dataset (ICL NUIM). Instead, I plan to use the Replica dataset. I'm curious about how you managed to obtain the ground truth point cloud to generate the points3D.bin file. Did you calculate the point cloud using the depth image and their corresponding positions and poses for each image, or did you use another method? Thank you for your answer.

leblond14u commented 6 months ago

Hi,

For the GT points I indeed used the re-projection of the depth and RGB images. I've also planned to use the Replica dataset later on but wanted to really understand the root of my problem with a more hands on dataset like the Augmented ICL NUIM one. I checked the geometric similarity with this approach and the COLMAP generated point-cloud. And they align perfectly (modulo a scaling and rigid transformation). Otherwise there's a non-colored point-cloud given with the dataset that you can directly use.

Have you managed to use the Replica GT poses yet ? If so I'm craving to hear a solution for my issue :)

update n°1 : I still haven't managed to obtain correct poses for the 3DGS to render. My renders still look off using the image.txt file and COLMAP loader. I tried inputting my C2W transforms and W2C transforms but none worked for now (see renders below). Some of the poses are "rather close" to the GT images but some are still completely off (the rendered images are labelled as prediction here). image image

leblond14u commented 6 months ago

Update n°2 : 🎉 I solved my problem, thanks for your help and directions. Somehow the conversion used in 3DGS is NOT applicable to my data/SLAM like data. As you were working with SLAM data I found this issue on adapting ORB-SLAM to COLMAP that helped me understand the root of my issue. I reviewed my dataloading sequence, simplified it and rewrote my C2W to W2C conversion without inverting the axes and it worked perfectly.

Liu-Jinxin commented 6 months ago

leblond14u

Hi @leblond14u , Congratulations on successfully resolving the issue! I've recently embarked on tackling this problem. If possible, could you share any key points from your solution or consider making your approach available in a repository? I believe many others, myself included, would greatly benefit from your repo. Thank you for considering my request. Your contribution could make a significant difference.

leblond14u commented 6 months ago

Yes sure, I'll see what I can do about it. In the meantime just tag me in an issue here if you have a specific question. I'll be happy to help as well :)

Have a nice day, Best,

Hugo

Liu-Jinxin commented 6 months ago

Hi Hugo @leblond14u ,

Thank you for your willingness to help and for considering making your approach available. Your expertise is greatly appreciated in this community. I've been working through the implementation details and have come across a few points where your insight could significantly clarify my understanding and approach. Here are the specifics:

  1. Selection in images.txt: For the POINTS2D[] within images.txt, I'm curious about the selection criteria for points to include. Are all points from an image considered, or is there a feature extraction step (like ORB) involved where only key feature points are included?

  2. Relationship in points3D.txt: Regarding the TRACK[] in points3D.txt, is the relationship between point3D and image feature points strictly one-to-one, or is there a method employed that enables a one-to-many relationship between a point3D and feature points across different images? I'm also wondering if this point-to-3D correspondence plays a role in gaussian splatting rendering.

  3. Error Handling: For the error metric, is it set to zero by default, or is there a specific computational method you rely on to define this error?

  4. 3D Feature Points Quantity: Regarding the selection of 3D feature points, is around 10,000 points considered optimal? Does this imply that with an increase in the number of images, the feature points selected per image should decrease?

Thank you again for your time and support.

Jinxin

leblond14u commented 6 months ago

Thank you very much @Liu-Jinxin, That's my pleasure. I'll try to answer as best as I can :)

  1. As far as my understanding of COLMAP goes the points2D are the points that colmap generated during the feature extraction and matching process that are visible on the image n°i. I think COLMAP uses SIFT features.
  2. I'm not sure I understood your question correctly but I'll try to answer still. The points featured in images.txt being the same as in points3D.txt there's the same number of points in total. In addition following my first answer you can find similar points featured in different images lines on images.txt. Concerning the impact of those points-to-images relationship on the splatting, as far as my experiments went it does not seem to play a role. Only the spacial accuracy of the generated points is indeed impacting the Gaussian splatting process since the latter is densifying from the initial point-cloud.
  3. As I used my ground truth data I didn't investigated the COLMAP induced error yet.
  4. I'm not sure if COLMAP has a feature points limit, in the sense it would lower to features detected per images for large datasets to get close to 10k points. Since the feature extracted are SIFT features I think the entire process just relies on how much of those features are detected and simply filters out outliers during the matching process.
Liu-Jinxin commented 6 months ago

Thank you very much @Liu-Jinxin, That's my pleasure. I'll try to answer as best as I can :)

  1. As far as my understanding of COLMAP goes the points2D are the points that colmap generated during the feature extraction and matching process that are visible on the image n°i. I think COLMAP uses SIFT features.
  2. I'm not sure I understood your question correctly but I'll try to answer still. The points featured in images.txt being the same as in points3D.txt there's the same number of points in total. In addition following my first answer you can find similar points featured in different images lines on images.txt. Concerning the impact of those points-to-images relationship on the splatting, as far as my experiments went it does not seem to play a role. Only the spacial accuracy of the generated points is indeed impacting the Gaussian splatting process since the latter is densifying from the initial point-cloud.
  3. As I used my ground truth data I didn't investigated the COLMAP induced error yet.
  4. I'm not sure if COLMAP has a feature points limit, in the sense it would lower to features detected per images for large datasets to get close to 10k points. Since the feature extracted are SIFT features I think the entire process just relies on how much of those features are detected and simply filters out outliers during the matching process.

Thanks Hugo, I will have a try.

Tariq-Abuhashim commented 4 months ago

Hi @leblond14u do you mind sharing your code that converts orb-slam pose and 3d points to colmap coordinate frame? Thank you.