prstrive / UniMVSNet

[CVPR 2022] Rethinking Depth Estimation for Multi-View Stereo: A Unified Representation
MIT License
228 stars 12 forks source link

Training and Inference Time #3

Closed II-Matto closed 2 years ago

II-Matto commented 2 years ago

Thanks for sharing this excellent work! As I have not found the description of training and inference time in the paper, I would like to ask how long it will take to train or fine-tune the model. And how long does it take to perform depth inference on 5 images (1 reference + 4 source)? It would be so nice of you to present the training and inference time with detailed description of the settings, e.g. the type and the number of GPUs, resolution of input images, size of training set, the corresponding accuracy/completeness, etc. Many thanks.

prstrive commented 2 years ago

As declared in our paper, our Unification representation and Unified Focal Loss are memory-free and computation-free modules. Therefore, the computational cost and run-time of Unification+UFL is the same as our baseline CasMVSNet, and you can refer to the Figure 1 in their paper. And if you adopt the adaptive aggregation strategy, the training time for an epoch increases by about 20 minutes from our experiments with the input size 512x640 in two Tesla V100 GPUs. The corresponding performance will also be improved to a certain extent. You can refer to our Table 3 for more detail.

II-Matto commented 2 years ago

Thanks for your reply. One more question: do the new task formulation and the loss have influences on the convergence speed of training? Currently the number of training epochs seems to remain the same as CasMVSNet.

prstrive commented 2 years ago

In fact, training for 12 epochs can already get the similar performance, so we think the convergence speed will at least not be slower than our baseline.