Study papers 1: SfM, BA, and feature matching

gkiavash commented 1 year ago

* : additional resourses

gkiavash commented 1 year ago

Multi-View Optimization of Local Feature Geometry

a further geometric refinement of the already detected keypoints.
First estimate local geometric transformations between tentative matches and
Then, optimize the keypoint locations over multiple views jointly according to a non-linear least squares formulation.

3) Method

Overview
- a tentative matches graph G = (V, E) with keypoints as nodes and matches as edges, optionally weighted (e.g., by the cosine similarity of descriptors)
- For each edge(match), perform a two-view refinement using a patch alignment network. This network is used to annotate the edges of the tentative matches graph with geometric transformations Tu→v, Tv→u.
- we partition the graph into components (i.e., features tracks) and find a global consensus by optimizing a non-linear least squares problem over the keypoint locations, given the estimated two-view transformations.
Two-view refinement
- Given local patches, Pu, Pv around the initial keypoint locations u, v ∈ R2, predicts the flow du→v of the central pixel from one patch to the other and vice versa as dv→u.
- Use Siamese architecture for feature extraction followed by a correlation layer for don product similarity.
- Patch has h×w dimension. apply CNN for each pixel and create "d" dimension description (h×w×d dimension)
- Run dot product similarity between two patches => R^(hwh*w)
- Regress the minimum value to be new displacement
Multi-view refinement
- displacement chains without loops may make a big error
- create tracks for each 3D point in all images
- feature matching is imperfect. A single incorrect match can merge two tracks => partition the connected components into smaller, more reliable subsets based on the descriptor cosine similarity
- approximate the full flow field between two patches by a 3 × 3 displacement grid and use bi-square interpolation
Evaluation:

Image matching: HPatches dataset, The area under the overall curve (AUC): AUC explained
Triangulation: ETH3D benchmark, run the multi-view triangulator of COLMAP, ETH3D evaluation code

gkiavash commented 1 year ago

Deep Two-View Structure-from-Motion Revisited

An optical flow estimation network that predicts dense correspondences between two frames;
A normalized pose estimation module that computes relative camera poses from the 2D optical flow correspondences
A scale-invariant depth estimation network that leverages epipolar geometry to reduce the search space, refine the dense correspondences, and estimate relative depth maps.

Related Works

Type 1: monocular depth estimation network and a pose regression network, self-supervisory
- SfMLearner estimates mask to exclude the dynamic objects GeoNet utilizes an optical flow module to mask out these outliers by comparing the rigid flow
Type 2: require two image frames to estimate depth maps and camera poses at test time
- DeMoN concatenates a pair of frames and uses multiple stacked encoder-decoder networks to regress camera poses and depth maps, implicitly utilizing multi-view geometry.

asd

3) Method Our method is able to find better matching points and therefore more accurate poses and depth maps, especially for textureless and occluded areas. At the same time, it follows the wisdom of classic methods to avoid ill-posed problems.

Optical Flow Estimation
- DICLFlow to generate dense matching points between two consecutive frames: This method uses a displacement invariant matching cost learning strategy and a soft-argmin projection layer to ensure that the network learns dense matching points rather than image-flow regression
Essential Matrix Estimation
- Use matching points to compute camera poses (previous deep learning-based methods regress the camera poses from input images)
- We should robustly filter the noisy dense matches from optical flow.
- We found out that using SIFT keypoint locations to generate a mask works well in all datasets. The optical flow matches at the locations within the mask are filtered to avoid distraction by dynamic objects. Idea: more accurate optical flow in richer area
Scale-Invariant Depth Estimation
- Performed matching again by reducing the search space to epipolar lines computed from the relative camera poses. This process is similar to multi-view stereo (MVS) matching with one important difference: we do not have the absolute scale in inference
- Plane-sweep powered networks require consistent scale during the training and testing process.
Loss Function:
- Refer to the image:

4) Experiments

1) Datasets: (Just to get familiar with existing datasets and their usages)
- KITTI Depth [16]: monocular depth evaluation in autonomous driving scenarios, which does not take camera motions and dynamic objects into account
- KITTI VO [16]: is used for evaluating camera pose estimation. It contains ten sequences (more than 20k frames) with ground truth camera poses.
- MVS: is collected from several outdoor datasets. Different from KITTI which is built through video sequences with close scenes, MVS has outdoor scenes from various sources.
- Scenes11: is a synthetic dataset generated by random shapes and motions. It is therefore annotated with perfect depth and pose, though the images are not realistic.
- SUN3D: provides indoor images with noisy depth and pose annotation.
2) Depth Evaluation
- D1-all: Percentage of stereo disparity outliers with errors greater than 3 pixels for the background (D1-bg), foreground (D1-fg), and all (D1-all) pixels (from here)
3) Camera Pose Estimation
4) Framework Analysis and Justification
- Estimating Camera Pose from Optical Flow: 2 kinds of methods: deep regression (PoseNet, ResNet50) and the classic 5-point algorithm [24] with RANSAC.
- DeepV2D forces networks to regress camera poses with scales and then make the depth scaled

gkiavash commented 1 year ago

BA-NET: DENSE BUNDLE ADJUSTMENT NETWORKS

Solve SfM problem via feature-metric bundle adjustment (BA), which explicitly enforces multi-view geometry constraints in the form of feature-metric error

Introduction

learn a MLP NN to predict the damping factor in the LM algorithm, which makes all involved computation differentiable.
BA-layer minimizes the distance between aligned CNN feature maps.
Our feature-metric BA, inputs: CNN features of multiple images, and optimizes for the scene structures and camera motion.
geometric BA does not exploit all image information, while the photometric BA is sensitive to moving objects, exposure or white balance changes, etc.
Most importantly, BA-Layer can back-propagate loss from scene structures and camera motion to learn appropriate features that are most suitable for structure-from-motion and bundle adjustment
We estimate a dense per-pixel depth. A major challenge in solving dense per-pixel depth is to find a compact parameterization.
Direct per-pixel depth is computationally expensive, which makes the network training intractable.
So we train a network to generate a set of basis depth maps for an arbitrary input image and represent the result depth map as a linear combination of these basis depth maps. The combination coefficients will be optimized in the BA-Layer.
This parameterization guarantees a smooth depth map with good consistency with object boundaries. It also reduces the number of unknowns and makes dense BA possible in networks.
CodeSLAM also does depth parameterization

3 BUNDLE ADJUSTMENT REVISITED

photometric BA algorithm to eliminate feature matching and directly minimizes the photometric error (pixel intensity difference) of the aligned pixel, superior performance, especially at less textured scenes
Drawbacks:
- sensitive to initialization because the photometric error increases the non-convexity
- sensitive to camera exposure and white balance changes
- more sensitive to outliers such as moving objects.

4 THE BA-NET ARCHITECTURE

asd

No feature extraction, No all pixels, create a pyramid of features for each image and align them
4.2 FEATURE PYRAMID: We use DRN-54 which generates smoother pyramid feature maps as the inputs for the BA-Layer. Learn features suitable for SfM via back-propagation, instead of using pre-trained CNN features for image classification
4.3 BUNDLE ADJUSTMENT LAYER: After building feature pyramids for all images, we optimize camera poses and a dense depth map by minimizing the feature-metric error in Equation (4). Following the conventional Bundle Adjustment principle, we optimize Equation (4) using the Levenberg-Marquardt (LM) algorithm
4.4 BASIS DEPTH MAPS GENERATION: use CNN for monocular image depth estimation as a compact parameterization, rather than using it as an initialization. Use a standard encoder-decoder architecture for monocular depth learning. The final depth map is generated as the linear combination of these basis depth maps, which is: D = ReLU(w^T * B). "w" will be optimized in our BA-Layer
4.5 TRAINING: The BA-Net learns the feature pyramid, the damping factor predictor, and the basis depth maps generator in a supervised manner. Losses: Camera Pose Loss, Depth Map Loss

5 EVALUATION

ScanNet is a large-scale indoor dataset with 1,513 sequences in 706 different scenes, camera poses and depth maps
Absolute Trajectory Error (ATE), which measures the Euclidean differences between two trajectories

gkiavash commented 1 year ago

Pixel-Perfect Structure-from-Motion with Featuremetric Refinement

First, adjust the initial keypoint locations prior to any geometric estimation
- refines the 2D keypoints only from tentative matches by optimizing a direct cost over dense feature maps
Second, refine points and camera poses as post-processing.
- This refinement is robust to large detection noise and appearance changes,
- as it optimizes a featuremetric error based on dense features predicted by a neural network
- The second stage operates after SfM and refines 3D points and poses with a similar featuremetric cost (Dusmanu).
The resulting system produces accurate reconstructions and scales well to large scenes with thousands of images.

Related work

Dusmanu refined keypoint locations prior to SfM via geometric cost constrained with local optical flow. Improved SfM, but has limited accuracy and scalability.
In this work, we argue that local image information is valuable throughout the SfM process to improve its accuracy

4. Approach

Summarizing dense image information into sparse points is necessary to perform global data association and optimization at scale.
However, refining geometry is an inherently local operation, which, we show, can efficiently benefit from locally-dense pixels.
The dense information only needs to be locally accurate and invariant but not globally discriminative.

4.1. Featuremetric optimization

Direct alignment:

We consider the error between image intensities at two sparse observations
Local image derivatives implicitly define a flow from one point to the other through a gradient descent update

Learned representation:

We thus turn to features computed by deep CNNs, which can exhibit high invariance by capturing a large context, yet retain fine local details.
For each image, we compute a D dimensional, L2-normalized feature map Fi ∈ R^(W×H×D) at identical resolution.

4.2. Keypoint adjustment

The same as Dusmanu, track separation: adjust the locations of 2D keypoints belonging to the same track j by optimizing its featuremetric consistency along tentative matches
The difference: features are considered in the loss function and derivatives of pixel intensities are considered in optimization

asd

4.3. Bundle adjustment

Reduce the difference between features of the projected points
Refer to the image below:

5. Experiments

3D triangulation:
- in the ETH3D benchmark,
- accuracy and completeness of the reconstruction, in %, as the ratio of triangulated and ground-truth dense points that are within a given distance of each other.
- compared to SIFT, SuperPoint, D2-Net, R2D2,
- compared to optimizer Patch Flow
Camera pose estimation: report the area under the cumulative translation error curve (AUC) up to 1mm, 1cm, and 10cm (How?)
End-to-end Structure-from-MotionL
- relative poses within each subset, report the AUC of the pose error at the threshold of 5 degrees, where the pose error is the maximum of the angular errors in each axis in rotation and translation
Additional:
- the Aachen Day-Night dataset
- Precomputing distance maps reduces the peak memory requirement of the bundle adjustment from 80 GB to less than 10GB for 1000 images

gkiavash commented 1 year ago

Metrics: Known Ground Truth:

Average EndPoint Error (AEPE): Euclidean distance between estimated and ground truth flow fields, averaged over all valid pixels of the target image.
Percentage of Correct Keypoints (PCK): percentage of correspondences with a Euclidean distance error to the ground truth that is smaller than a threshold δ.

gkiavash commented 1 year ago

Deep Patch Visual Odometry

1. Introduction

Prior work typically treats VO as an optimization problem solving for a 3D model of the scene which best explains the visual measurements.
Indirect approaches first detect and match keypoints between frames, then solve for poses and 3D points which minimize the reprojection distance
Direct approaches, on the other hand, operate directly on pixel intensities, attempting to solve for poses and depths which align the images
The main issue with prior systems, both direct and indirect, is the lack of robustness
deep learning gives reliable feature matching
We introduce DPVO, a novel patch-based deep VO system that overcomes these limitations.
- The central piece of our approach is a deep patch representation.
- We use a neural network to extract patches from incoming frames.
- A recurrent neural network is then used to track each patch through time—alternating patch trajectory updates with a differentiable bundle adjustment layer.
The novelty of our approach
- (1) patch-based correspondence:: improves efficiency and robustness over dense flow
- (2) recurrent iterative updates: allow end-to-end learning of reliable feature matching
- (3) differentiable bundle adjustment

3. Approach

Our approach has two main modules:

patch extractor extracts a sparse collection of image patches from incoming frames.
update operator attempts to track these patches through time using a recurrent neural network alternating iterative updates with bundle adjustment

3.1. Feature and Patch Extraction

2 residual networks to extract features from the input images: 1) extracts matching 2) extracts context features
The final feature map is one-quarter the input resolution
Additionally extract patches from both the matching and context feature maps.
Patch centroids are randomly sampled then we use bilinear interpolation for feature retrieval

3.2. Update Operator

update both poses and patches. This is done by performing revisions to patch trajectories
The update operator acts on the patch graph, and each edge in the patch graph is augmented with a hidden state (dimension 384).

gkiavash / Master-Thesis-Structure-from-Motion