Open puyiwen opened 1 year ago
Thanks for appreciating our work. Currently, we have only released the largest models (BEiT-L backbone), having around 340 million params in total. We will be releasing other lighter models including mobile versions soon but the timeline is not fixed as of now.
Regarding GFLOPS, I am not aware of a reliable way to calculate FLOPs on your hardware.
Thank you for your reply, the GFLOPs of the model I mean, may be torchstat
to calculate the GFLOPs of pytorch model.
And I have another question, I have tested this model in my real scene, it works very good. Why does this model generalize so well? Is it because of mixing with relative depth?
Hi @puyiwen, there are two main reasons: 1) relative depth pre-training on a large dataset 2) automatic routing to different heads for indoor and outdoor scenes
Hi @puyiwen, there are two main reasons:
- relative depth pre-training on a large dataset
- automatic routing to different heads for indoor and outdoor scenes
Thank you for your reply, I have always wondered why the generalization of the monocular depth estimation is worse than object recognition? Is it because the dataset is not large enough? Or is it that tasks are different from task to task?
Datasets for dense prediction task are much smaller since ground-truth is harder to get (per-pixel annotation vs. just a bounding box). Object recognition (at least closed-set) also has a fixed (small) set of classes compared to a full image of continuous values per pixel for depth estimation.
用于密集预测任务的数据集要小得多,因为更难获得基本事实(每像素注释与仅边界框)。与用于深度估计的每像素连续值的完整图像相比,对象识别(至少是闭集)也有一组固定的(小的)类。
Thank you very much! I also want to know what are the data augmentation methods for this model? I am quite confused about the RandomCrop
data augmentation method. Each picture is randomly cropped, the cropped picture is partially enlarged in the human eyes, but the real distance does not change. If one close object is cropped, the cropped RGB image is similar to the RGB image cropped from a long distance, but the depth distances are very different. Will this way of training the model not conform to the logic of the human eye?
We dont use RandomCrop (flag is set to False). However, the problem of varying camera parameters as you mentioned still exists. This makes monocular depth estimation highly ill-posed as there can be multiple real scenes + camera settings that can potentially lead to the same image. This is also one of the reasons training for metric depth on multiple datasets (with different camera settings) is a hard problem. One may try to somehow include camera metadata as input or try to estimate the parameters, but that's not our focus in this paper. Also, one can still use RandomCrop by using absolute position encodings, evaluating on training resolution, or using a sliding window prediction etc.
There are, however, some visual cues that can hint at the "level of zoom" and camera settings. For example, the "amount" of context present in the image along with distortions, aberrations, texture gradient, amount of blur and its gradient etc. You can pick some of the cues yourself by looking at "Dolly zoom effect" videos. A neural network can technically learn to exploit these cues from the data and learn a strong prior to get around this problem.
@shariqfarooq123 Any update on a timeline for when the smaller models will be released? Im working on a project that would benefit the smaller models even at a lower resolution.
We dont use RandomCrop (flag is set to False). However, the problem of varying camera parameters as you mentioned still exists. This makes monocular depth estimation highly ill-posed as there can be multiple real scenes + camera settings that can potentially lead to the same image. This is also one of the reasons training for metric depth on multiple datasets (with different camera settings) is a hard problem. One may try to somehow include camera metadata as input or try to estimate the parameters, but that's not our focus in this paper. Also, one can still use RandomCrop by using absolute position encodings, evaluating on training resolution, or using a sliding window prediction etc.
There are, however, some visual cues that can hint at the "level of zoom" and camera settings. For example, the "amount" of context present in the image along with distortions, aberrations, texture gradient, amount of blur and its gradient etc. You can pick some of the cues yourself by looking at "Dolly zoom effect" videos. A neural network can technically learn to exploit these cues from the data and learn a strong prior to get around this problem.
Hi @shariqfarooq123,
As you know, we can run zoedepth on KITTI and get nice metric results. The question is: how to apply the KITTI fine-tuned model to an arbitrary image with different resolution and focal length than KITTI? For example, if I crop the kitti images (filed of view modification), or resize them (focal length modification), I may get wrong metric results. Is there any systematic approach to compensate for these scenarios?
For example, is something like this valid here? If yes, probably we can compensate for the focal length difference. focal_length_GT/depth_GT = focal_length_test/depth_test
Curious about any update on the mobile version?
Ive been patiently waiting too :)
Thank you for your great work!! I want to know the params and GFLOPs of the model, can you help me? Thank you very much!!