COLMAP

Steps of Image-based 3D reconstruction from images

recovers sparse representation from the scene
recovers camera poses of the input images using Structure-from-Motion
This output then serves as the input to Multi-View Stereo to recover a dense representation of the scene.

Automatic Reconstruction tool

INPUT: takes a folder of input images
OUTPUT: a sparse and dense reconstruction
path/to/project/sparse: the sparse models for all reconstructed components
path/to/project/dense: their corresponding dense models.

Structure-from-Motion (SfM)

is the process of reconstructing 3D structure from its projections into a series of images
The input is a set of overlapping images of the same object, taken from different viewpoints.
The output is a 3-D reconstruction of the object, and the reconstructed intrinsic and extrinsic camera parameters of all images
Typically, Structure-from-Motion systems divide this process into three stages:
- Feature detection and extraction
- Feature matching and geometric verification
- Structure and motion reconstruction

Multi-View Stereo (MVS)

takes the output of SfM to compute depth and/or normal information for every pixel in an image
Fusion of the depth and normal maps of multiple images in 3D then produces a dense point cloud of the scene.
Using the depth and normal information of the fused point cloud, algorithms such as the (screened) Poisson surface reconstruction can then recover the 3D surface geometry of the scene.

Terminology

camera: is associated with the physical object of a camera using the same zoom-factor and lens. A camera defines the intrinsic projection model in COLMAP. A single camera can take multiple images with the same resolution, intrinsic parameters, and distortion characteristics. -
image is associated with a bitmap file, e.g., a JPEG or PNG file on disk.
COLMAP detects keypoints in each image whose appearance is described by numerical descriptors.
Pure appearance-based correspondences between keypoints/descriptors are defined by matches
while inlier matches are geometrically verified and used for the reconstruction procedure.
Step 1. feature extracting. (using SIFT) Step 2. feature matching. (exhaustive matching, sequential ... ) Step 3. Sparse Reconstruction.
sparse representation of the scene and the camera poses
sparse contains cameras.txt, images.txt, points3D.txt
cameras.txt: intrinsic. one line of data per camera:
- CAMERA_ID, MODEL, WIDTH, HEIGHT, PARAMS[]
- eg. 1 PINHOLE 1024 575 1467.12 1467.12 512 287.5
- PINHOLE type use only two focal length params (4 values)
images: 2 lines per image
- This file contains the pose and keypoints of all reconstructed images in the dataset
- The local camera coordinate system of an image is defined in a way that the X axis points to the right, the Y axis to the bottom, and the Z axis to the front as seen from the image.
- IMAGE_ID, QW, QX, QY, QZ, TX, TY, TZ, CAMERA_ID, NAME
- POINTS2D[] as (X, Y, POINT3D_ID)
3D point list with one line of data per point:
- POINT3D_ID, X, Y, Z, R, G, B, ERROR, TRACK[] as (IMAGE_ID, POINT2D_IDX)

Step 4. Dense Reconstruction.

denser scene geometry. MVS(Multiview Stereo) can be now recovered
1. undistort the images
2. compute the depth and normal maps using stereo
3. fuse the depth and normals maps to a point cloud
4. [optional] point cloud meshing step (estimate a dense surface from the fused point cloud using Poisson or Delaunay reconstruction)

Output format of sparse and dense reconstruction are explained here https://colmap.github.io/format.html.

Database

COLMAP stores all extracted information in a single SQLite database file.
The database can be accessed with the database management toolkit in the COLMAP GUI, the provided C++ database API
The database contains the following tables:
- cameras
- images
- keypoints
- If the keypoints have 4 columns, then the feature geometry is a similarity and the third column is the scale and the fourth column the orientation of the feature
- If the keypoints have 6 columns, then the feature geometry is an affinity and the last 4 columns encode its affine shape
- descriptors (128-D)
- matches
- Feature matching stores its output in the matches table and geometric verification in the two_view_geometries table.
- COLMAP only uses the data in two_view_geometries for reconstruction.
- Every entry in the two tables stores the feature matches between two unique images, where the pair_id is the row-major, linear index in the upper-triangular match matrix
- two_view_geometries
- The F, E, H blobs in the two_view_geometries table are stored as 3x3 matrices in row-major float64 format.
- The meaning of the config values are documented in the src/estimators/two_view_geometry.h source file.

Readings

sghong977 commented 1 year ago

In the preprocessing step of NeuMan, there are 6 command lines using COLMAP. My goal is just understanding these commands:

feature_extractor
exhaustive_matcher
mapper
image_undistorter
patch_match_stereo
model_converter

colmap feature_extractor --database_path ./recon/db.db --image_path ./raw_720p --ImageReader.mask_path ./raw_masks --SiftExtraction.estimate_affine_shape=true --SiftExtraction.domain_size_pool=true --ImageReader.camera_model SIMPLE_RADIAL --ImageReader.single_camera 1

colmap exhaustive_matcher --database_path ./recon/db.db --SiftMatching.guided_matching=true

mkdir -p ./recon/sparse

colmap mapper --database_path ./recon/db.db --image_path ./raw_720p --output_path ./recon/sparse

if [ -d "./recon/sparse/1" ]; then echo "Bad reconstruction"; exit 1; else echo "Ok"; fi

mkdir -p ./recon/dense

colmap image_undistorter --image_path raw_720p --input_path ./recon/sparse/0/ --output_path ./recon/dense

colmap patch_match_stereo --workspace_path ./recon/dense

colmap model_converter --input_path ./recon/dense/sparse/ --output_path ./recon/dense/sparse --output_type=TXT

sghong977 commented 1 year ago

I'm gonna replace COLMAP feature extraction & matching to a modern feature algorithm, LoFTR.

LoFTR: Detector-Free Local Feature Matching with Transformers

project page: https://zju3dv.github.io/loftr/

What LoFTR do? Coarse level, dense matches -> refine feature at a fine level, instead of traditional steps of feature detection -> extraction -> matching.

How does the model work?

LoFTR is a transformer-based model, and the output is feature descriptors.
using self, cross-attention layer that conditioned on both images
2 feature maps for each image
calculate confidence matrix of feature pairs (각 이미지의 피쳐맵에서 이 좌표와 저 좌표가 얼마나 corresponding한지. 이미지의 1/8 사이즈로 계산함)
fine도 마찬가지.. correlation-based match라는데.. 암튼 쿼리 포인트 i에 대해서 가장 correspondence가 높은게 뭔지 heatmap이 있고, 그 softmax값 높은거랑 매치된다고 하는듯
is it supervised learning? how can the model generate fine descriptor -> ㅇㅇ. loss는 coarse, fine둘다 있는데 coarse에는 depth map같은 정보를 쓰고, fine에는 키포인트 명시된 데이터를 쓰는 듯. MegaDepth이런것도 COLMAP의 sparse reconstruction dataset 줌

Then how can I use this model? Does it provide pretrained model?

kornia library since 0.5.11 ver (pip install kornia) https://kornia-tutorials.readthedocs.io/en/latest/image_matching.html

let's make neuman docker container and execute following commands: pip install kornia pip install kornia_moons pip install opencv-python --upgrade

wget https://github.com/kornia/data/raw/main/matching/kn_church-2.jpg wget https://github.com/kornia/data/raw/main/matching/kn_church-8.jpg

feature_custom.py 작성한거 참고 ㄱ

sghong977 / Daily_AIML

[Survey List] Multiview Geometry, SLAM, ... #30

COLMAP

Readings

LoFTR: Detector-Free Local Feature Matching with Transformers