Open IneedhelpRr opened 5 months ago
MiDaS 模型的结构在他们的预印本论文中进行了描述,包括第 3 页的图表(图 1)。我在 github 上也有结构 (DPT) 的描述以及代码。
Yes, I tried to read his article, is it based on the ResNet encoder, plus a series of loss functions to predict? What I want to know more is its model structure, what is his convolutional layer? And I read about your DPT structure, as you know, I don't know much about the research in computer vision, so there are still some difficulties for me to understand. My goal is to understand the model structure of MIDAS and to try to train a model related to my domain for research based on this. But it looks difficult.
MiDaS 模型的结构在他们的预印本论文中进行了描述,包括第 3 页的图表(图 1)。我在 github 上也有结构 (DPT) 的描述以及代码。
是的,我试着看他的文章,是不是基于ResNet编码器,加上一系列的损耗函数来预测?我更想知道的是它的模型结构,他的卷积层是什么?我读过你的DPT结构,如你所知,我对计算机视觉的研究了解不多,所以我仍然有一些困难要理解。我的目标是了解MIDAS的模型结构,并尝试训练一个与我的领域相关的模型,以在此基础上进行研究。但这看起来很困难。
I've also looked at depthanything before, but that one is too difficult for me to understand, so I tried to understand the previous version of midas, I just noticed that you uploaded the muggled_dpt, which also piqued my interest, maybe you can point me out my confusion, if you have time.
is it based on the ResNet encoder
All of the newer MiDaS models (version 3 and 3.1) switched to using a 'vision transformer' instead of using ResNet, to encode input images, though the rest of the DPT structure still uses convolutions. Vision transformers work quite a bit differently from convolutional models. If you aren't familiar with them, I'd recommend the original paper that introduced them: "An Image is Worth 16x16 Words". There's also a very big guide on transformers more generally (more for text than images) called "The Illustrated Transformer".
The DPT model structure consists of 4 parts: the first part is the patch embedding and vision transformer, which generates a lists of vectors (also called tokens) based on the input image. The second part (called reassembly) takes the lists of vectors and reshapes them back into image-like data (like a grid of pixels). The third part (called fusion) combines the reassembled image data and also performs convolution on the results. The fourth part (called the head) just does more convolution to generate the final depth output. Each of the parts also include scaling/resizing steps, but this is hard-coded into the model (it's not something that needs to be learned by the model).
That original MiDaS preprint actually has a figure in the appendix (on page 12) which shows the structure of the convolutional steps performed inside the 'fusion' part of the model (which is figure (a) on page 12, they call it 'Residual Convolutional Unit') as well as the convolutions performed inside the 'head' part of the model (which is figure (b) on page 12).
There are a lot of pieces to the DPT model, but I think if you try to understand each one individually (i.e. understanding just the reassembly model on it's own, or the fusion model on it's own), it's much easier to make sense of the whole thing.
I've also looked at depthanything before, but that one is too difficult for me to understand
I actually think the depthanything implementation is simpler than any of the MiDaS models (though they all share the same DPT structure). On the muggleDPT repo, there is separate code for each of the model components: patch embedding, vision transformer, reassembly model, fusion model and head model, if you're comfortable looking through code, the forward
function of each of these models describes exactly what the model actually does, I think it's the best way to understand how any model works actually.
它是基于ResNet编码器的吗
所有较新的 MiDaS 模型(版本 3 和 3.1)都改用“视觉转换器”而不是使用 ResNet 来编码输入图像,尽管 DPT 结构的其余部分仍然使用卷积。视觉转换器的工作方式与卷积模型有很大不同。如果你不熟悉它们,我会推荐介绍它们的原始论文:“一张图片值得 16x16 个字”。还有一个关于变形金刚的非常大的指南(更多的是文本而不是图像),称为“图解变形金刚”。
DPT 模型结构由 4 部分组成:第一部分是补丁嵌入和视觉转换器,它根据输入图像生成向量列表(也称为标记)。第二部分(称为重组)获取向量列表,并将它们重新塑造成类似图像的数据(如像素网格)。第三部分(称为融合)结合了重新组合的图像数据,并对结果进行卷积。第四部分(称为头部)只是做更多的卷积来生成最终的深度输出。每个部分还包括缩放/调整大小步骤,但这被硬编码到模型中(这不是模型需要学习的东西)。
最初的 MiDaS 预印本实际上在附录(第 12 页)中有一个图,它显示了在模型的“融合”部分(即第 12 页的图 (a)中执行的卷积步骤的结构,他们称之为“残差卷积单元”)以及在模型的“头部”部分执行的卷积(即第 12 页的图 (b))。
DPT 模型有很多部分,但我认为,如果你试着单独理解每一个部分(即只理解重组模型本身,或者单独理解融合模型),就更容易理解整个事情。
我以前也看过深度,但那个对我来说太难理解了
实际上,我认为depthanything的实现比任何MiDaS模型都简单(尽管它们都共享相同的DPT结构)。在 muggleDPT 存储库上,每个模型组件都有单独的代码:补丁嵌入、视觉转换器、重组模型、融合模型和头部模型,如果你愿意浏览代码,这些模型中的每一个的功能都准确地描述了模型的实际作用,我认为这是理解任何模型实际工作方式的最佳方式。
forward
How do I train my own model, based on that
How do I train my own model, based on that
In theory, any 'typical' training loop should work on these DPT models. However, doing a good job with training is generally a difficult thing to get right, and there are entire research papers devoted to this. It's basically a PhD thesis topic at the moment. For example, the depth-anything paper is like this, it's almost entirely focused on how to do a better job training these models rather than being about the model structure. So it can be very difficult to understand!
There's surprisingly little example code available for training these types of models (at least, I haven't found much). The only one I know of is for the original ZoeDepth models and the related depth-anything v1 metric depth and v2 metric depth models. So I'd recommend starting with that code to get an idea of how to handle the training of the models, as well as reading the original MiDaS paper that describes the training procedure (starts on page 5), and the first depth-anything paper which describes a similar procedure (starts on page 3).
Alternatively, the Marigold (very accurate, but slower than DPT models) repo released training code, which you might want to check out as well (if you don't specifically need a DPT model).
How do I train my own model, based on that
In theory, any 'typical' training loop should work on these DPT models. However, doing a good job with training is generally a difficult thing to get right, and there are entire research papers devoted to this. It's basically a PhD thesis topic at the moment. For example, the depth-anything paper is like this, it's almost entirely focused on how to do a better job training these models rather than being about the model structure. So it can be very difficult to understand!
There's surprisingly little example code available for training these types of models (at least, I haven't found much). The only one I know of is for the original ZoeDepth models and the related depth-anything v1 metric depth and v2 metric depth models. So I'd recommend starting with that code to get an idea of how to handle the training of the models, as well as reading the original MiDaS paper that describes the training procedure (starts on page 5), and the first depth-anything paper which describes a similar procedure (starts on page 3).
Alternatively, the Marigold (very accurate, but slower than DPT models) repo released training code, which you might want to check out as well (if you don't specifically need a DPT model). We can add a contact information to facilitate better communication
如何在此基础上训练自己的模型
从理论上讲,任何“典型”训练循环都应该适用于这些 DPT 模型。然而,做好培训通常是一件很难做到的事情,并且有整篇研究论文专门讨论这一点。目前,这基本上是一个博士论文题目。例如,深度论文就是这样,它几乎完全专注于如何更好地训练这些模型,而不是关于模型结构。所以它可能很难理解!
令人惊讶的是,可用于训练这些类型的模型的示例代码很少(至少,我没有找到太多)。我唯一知道的是原始的 ZoeDepth 模型和相关的深度 - 任何 v1 公制深度和 v2 公制深度模型。因此,我建议从该代码开始,以了解如何处理模型的训练,并阅读描述训练过程的原始 MiDaS 论文(从第 5 页开始),以及描述类似过程的第一篇深度论文(从第 3 页开始)。
或者,Marigold(非常准确,但比 DPT 模型慢)存储库发布了训练代码,您可能还想查看一下(如果您不是特别需要 DPT 模型)。
I just noticed if it's possible to use metric-depth in depthanything to train a model of your own with his stuff
The structure of the MiDaS model is described in their preprint paper, including a diagram (figure 1) on page 3. I also have a description of the structure (DPT) on github along with code.