isl-org / ZoeDepth

Metric depth estimation from a single image
MIT License
2.34k stars 214 forks source link

About model size #5

Open puyiwen opened 1 year ago

puyiwen commented 1 year ago

Thank you for your great work!! I want to know the params and GFLOPs of the model, can you help me? Thank you very much!!

shariqfarooq123 commented 1 year ago

Thanks for appreciating our work. Currently, we have only released the largest models (BEiT-L backbone), having around 340 million params in total. We will be releasing other lighter models including mobile versions soon but the timeline is not fixed as of now.

Regarding GFLOPS, I am not aware of a reliable way to calculate FLOPs on your hardware.

puyiwen commented 1 year ago

Thank you for your reply, the GFLOPs of the model I mean, may be torchstat to calculate the GFLOPs of pytorch model.

puyiwen commented 1 year ago

And I have another question, I have tested this model in my real scene, it works very good. Why does this model generalize so well? Is it because of mixing with relative depth?

thias15 commented 1 year ago

Hi @puyiwen, there are two main reasons: 1) relative depth pre-training on a large dataset 2) automatic routing to different heads for indoor and outdoor scenes

puyiwen commented 1 year ago

Hi @puyiwen, there are two main reasons:

  1. relative depth pre-training on a large dataset
  2. automatic routing to different heads for indoor and outdoor scenes

Thank you for your reply, I have always wondered why the generalization of the monocular depth estimation is worse than object recognition? Is it because the dataset is not large enough? Or is it that tasks are different from task to task?

thias15 commented 1 year ago

Datasets for dense prediction task are much smaller since ground-truth is harder to get (per-pixel annotation vs. just a bounding box). Object recognition (at least closed-set) also has a fixed (small) set of classes compared to a full image of continuous values per pixel for depth estimation.

puyiwen commented 1 year ago

用于密集预测任务的数据集要小得多,因为更难获得基本事实(每像素注释与仅边界框)。与用于深度估计的每像素连续值的完整图像相比,对象识别(至少是闭集)也有一组固定的(小的)类。

Thank you very much! I also want to know what are the data augmentation methods for this model? I am quite confused about the RandomCrop data augmentation method. Each picture is randomly cropped, the cropped picture is partially enlarged in the human eyes, but the real distance does not change. If one close object is cropped, the cropped RGB image is similar to the RGB image cropped from a long distance, but the depth distances are very different. Will this way of training the model not conform to the logic of the human eye?

shariqfarooq123 commented 1 year ago

We dont use RandomCrop (flag is set to False). However, the problem of varying camera parameters as you mentioned still exists. This makes monocular depth estimation highly ill-posed as there can be multiple real scenes + camera settings that can potentially lead to the same image. This is also one of the reasons training for metric depth on multiple datasets (with different camera settings) is a hard problem. One may try to somehow include camera metadata as input or try to estimate the parameters, but that's not our focus in this paper. Also, one can still use RandomCrop by using absolute position encodings, evaluating on training resolution, or using a sliding window prediction etc.

There are, however, some visual cues that can hint at the "level of zoom" and camera settings. For example, the "amount" of context present in the image along with distortions, aberrations, texture gradient, amount of blur and its gradient etc. You can pick some of the cues yourself by looking at "Dolly zoom effect" videos. A neural network can technically learn to exploit these cues from the data and learn a strong prior to get around this problem.

Syazvinski commented 1 year ago

@shariqfarooq123 Any update on a timeline for when the smaller models will be released? Im working on a project that would benefit the smaller models even at a lower resolution.

hgolestaniii commented 8 months ago

We dont use RandomCrop (flag is set to False). However, the problem of varying camera parameters as you mentioned still exists. This makes monocular depth estimation highly ill-posed as there can be multiple real scenes + camera settings that can potentially lead to the same image. This is also one of the reasons training for metric depth on multiple datasets (with different camera settings) is a hard problem. One may try to somehow include camera metadata as input or try to estimate the parameters, but that's not our focus in this paper. Also, one can still use RandomCrop by using absolute position encodings, evaluating on training resolution, or using a sliding window prediction etc.

There are, however, some visual cues that can hint at the "level of zoom" and camera settings. For example, the "amount" of context present in the image along with distortions, aberrations, texture gradient, amount of blur and its gradient etc. You can pick some of the cues yourself by looking at "Dolly zoom effect" videos. A neural network can technically learn to exploit these cues from the data and learn a strong prior to get around this problem.

Hi @shariqfarooq123,

As you know, we can run zoedepth on KITTI and get nice metric results. The question is: how to apply the KITTI fine-tuned model to an arbitrary image with different resolution and focal length than KITTI? For example, if I crop the kitti images (filed of view modification), or resize them (focal length modification), I may get wrong metric results. Is there any systematic approach to compensate for these scenarios?

For example, is something like this valid here? If yes, probably we can compensate for the focal length difference. focal_length_GT/depth_GT = focal_length_test/depth_test

Haiyan-Chris-Wang commented 6 months ago

Curious about any update on the mobile version?

Syazvinski commented 6 months ago

Ive been patiently waiting too :)