Boost COLMAP with initial positions

gkiavash commented 1 year ago

Introduction

The basic idea is to enhance the speed of SfM pipeline in COLMAP.

Currently, high-quality structure from motion pipelines is too far from real-time usage. For example, in our previous experiment (#8), the dataset contains 175 distorted frames with a 3 fps ratio which means 58 seconds. It took 2 hours to compute the final point cloud and camera poses. It takes much longer when the images are distorted and the camera parameters are given.

The main problem

After investigating the execution time and the logs from the COLMAP software, I realized that the most time-consuming parts are finding the initial pair and registering the next image. Since the images are sequential video frames, the order of registering the next images is known. Also, by having the initial guessed position of the next frame, there would be less error in bundle adjustment, and it converges faster.

gkiavash commented 1 year ago

In order to test if giving an initial location can reduce the time while preserving the quality of the point cloud, first, a new dataset with the same images from #8 is chosen. Then, by following colmap doc, initial positions are given to each image. At first, positions were obtained from the reconstruction in #8 with some manual permutations in order to see if it works. Then, other visual odometry applications are used to calculate the initial positions.

It is observed that the time of the whole time decreased from 2 hours to 6 minutes, and the point cloud with the same quality can be obtained. The permuted camera poses were refined, and the number of bundle adjustment steps decreased significantly.

gkiavash commented 1 year ago

Calculating initial camera poses

There are several approaches to visual odometry. I could successfully obtain good results from two approaches:

sparse colmap with low-quality images
- colmap sparse reconstruction, itself, can be used for finding initial camera poses.
- If we lower the resolution of the images and the limit of the number of detected features, not the matches, we would have the camera poses in a very short time. In my experiment, a batch of 30 images with a maximum of 2000 features is solved in less than 50 seconds. Note: if we decrease the limit more, there won't be enough points to converge. So, it tries more bundle adjustment, and thus, it takes more time.
- After obtaining the camera poses, we run the main reconstruction by known camera poses. The total time for my dataset (175 images) and almost the same quality as #8 (~100k points) became 15 minutes (including all processes like loading images + database, etc.)
cvg/Hierarchical-Localization:
- It has easy and fast installation and straightforward examples of usage.
- It takes a batch of images and calculates the scene, including ply and poses(called offline), and finds a pose for a query image(called online). Finding the pose of the query image is fast. However, since our images are sequential frames, it is not exactly what we need. For every new frame, we should recalculate the offline images, at least every k frames. I just used its offline results for a batch of images, not queries.
- Good accuracy for offline images

I tried to implement orbslam2 & 3, and vso. But, their installations were too complex, full of versioning conflicts (my tries). Nonetheless, I found some docker images with already installed orbslam3. However, still, I couldn't get the results. I am still looking for better visual slam codes

Conclusion

Having initial guesses for camera poses can reduce the execution time, significantly. Also, I have been working on adding frames sequentially (here). I succeeded in forcing the pipeline to register only the next frame at each step. However, it doesn't have enough quality, yet. I am working on putting the initial pose and sequential frames together.

gkiavash commented 1 year ago

results

For the demo, the same dataset described in #8 is used. First, I extracted camera poses from colmap itself by reducing the number of features. It took 15 minutes. Then, I used the camera poses for the main high quality construction. This time, it took only 4 minutes.

Here are some screenshots of the final point cloud which can be compared to https://github.com/gkiavash/Master-Thesis-Structure-from-Motion/issues/8#issuecomment-1374847826

snapshot01 snapshot03 Screenshot from 2023-03-06 00-05-59

As you can see, the poses are wrong. The head of the camera is tilted to the left, while it should be forward

Screenshot from 2023-03-06 00-07-47

gkiavash / Master-Thesis-Structure-from-Motion