markus-suchi / 3D-DAT

3D Scene Annotation and Dataset Toolkit
MIT License
9 stars 3 forks source link

How to use camera poses from colmap without robotic arm? #7

Open xczzz opened 1 year ago

xczzz commented 1 year ago

Hi, thanks for your work firstly. I want to use your work to annotate my dataset. I do not have a robotic arm, so I use colmap to get camera poses. But I don't know how to use the camera poses in your code. Could you give me some advice?

jibweb commented 12 months ago

Thanks for trying out our code! First of all, COLMAP will only recover the camera pose up to a scale factor. For our approach to be usable you will need to recover that scaling factor so that the scale of your camera poses and object models are consistent. You will need to know some 3d measurement in the real scene to recover it.

Otherwise, according to the COLMAP documentation, the camera posess are in the following format:

# Image list with two lines of data per image:
#   IMAGE_ID, QW, QX, QY, QZ, TX, TY, TZ, CAMERA_ID, NAME
#   POINTS2D[] as (X, Y, POINT3D_ID)
# Number of images: 2, mean observations per image: 2
1 0.851773 0.0165051 0.503764 -0.142941 -0.737434 1.02973 3.74354 1 P1180141.JPG
2362.39 248.498 58396 1784.7 268.254 59027 1784.7 268.254 -1
2 0.851773 0.0165051 0.503764 -0.142941 -0.737434 1.02973 3.74354 1 P1180142.JPG
1190.83 663.957 23056 1258.77 640.354 59070

whereas we use the TUM format, that is, in groundtruth_handeye.txt, each line represents a camera pose:

IMAGE_ID, TX, TY, TZ, QX, QY, QZ, QW

So it is just a matter of extracting the values from the colmap format

xczzz commented 12 months ago

@jibweb Thanks for your reply! I will try again according to your suggestion. Besides I have two other questions. (1) The transforms.json is obtained by colmap2nerf.py, the matrixes are calculated from camera poses of colmap. I wonder whether these matrixes should be the same with the poses in groundtruth_handeye.txt? (2) Are the camera poses of colmap the same with the poses in groundtruth_handeye.txt except the format?

jibweb commented 12 months ago

(1) I don't know the specifics of the script you are referring to, but they should represent the same information yes, that is the transformation from the camera frame to a consistent world frame (or the inverse transformation). If I remember correctly, it is common for NeRF methods to represent these transformations as 4x4 pose matrices (or using only the first 3 rows, ignoring the last 0,0,0,1 row from the homogeneous coordinates). These can be transformed to a quaternion and translation representation quite easily (using transforms3d or scipy)

(2) In principle yes! The only caveat is the scaling issue mentioned in the previous post, as colmap with RGB images only does not have enough information to recover the camera poses at the real scale, only up to a scaling factor

xczzz commented 12 months ago

Thank you very much!

xczzz commented 11 months ago

@jibweb Hi, I have tested your approach according to previous discussion. It works but there is still something wrong. I found that two parameters of instant-DexNerf may have an impact. They are scale and offset. I have checked instant-ngp repository but I do not know how to adjust the two parameters for my own dataset. Can you give me some advice?

jibweb commented 11 months ago

I believe you will need to obtain the real world scale of the scene. If none of the sensors used can provide that information, you will need to know the distance between two or more points of the scene to recover the scale.

This issue is also discussed in more details in one of the COLMAP issues: https://github.com/colmap/colmap/issues/1471