Closed ChambinLee closed 1 month ago
Hi, thank you for noticing our work :). Indeed, when there is a lack of shared information across clips, the model can potentially struggle with inconsistent predictions. To address this, we've employed a sliding window strategy to facilitate the exchange of temporal information between clips. I would recommend referring to Section 3.3 in our paper https://arxiv.org/pdf/2406.01493 for a more detailed explanation. Additionally, we've provided ablation results concerning the overlap between clips for further insights.
Hi, thank you for noticing our work :). Indeed, when there is a lack of shared information across clips, the model can potentially struggle with inconsistent predictions. To address this, we've employed a sliding window strategy to facilitate the exchange of temporal information between clips. I would recommend referring to Section 3.3 in our paper https://arxiv.org/pdf/2406.01493 for a more detailed explanation. Additionally, we've provided ablation results concerning the overlap between clips for further insights.
Thank you for your reply.🥰🥰🥰
I noticed that you said “Inference Strategy” in your article, and I think you used a sliding window based inpainting strategy to ensure the depth continuity between video clips.
But my question may be different from continuity. I would like to ask you about the first video clip, it is the result of generating from a pure noise. Due to the randomness of the diffusion model, the generated result should have randomness. So does this randomness have an impact on the results? Is a strategy like averaging for example needed to combine the different results?
In fact, this question comes from section 3.4 of https://arxiv.org/pdf/2312.02145. I saw that they raised this question, so I wanted to ask if there is an impact of this issue on video-based depth estimation.
We have indeed attempted to incorporate a test-time ensemble scheme, similar to Marigold. However, contrary to expectations, our empirical findings suggest that it was ineffective when combined with EDM or video-based depth estimation. It might be a worthwhile avenue to explore further :)
We have indeed attempted to incorporate a test-time ensemble scheme, similar to Marigold. However, contrary to expectations, our empirical findings suggest that it was ineffective when combined with EDM or video-based depth estimation. It might be a worthwhile avenue to explore further :)
Thank you very much for your reply, I have understood it. 🤗 I will close this issue.
Thank you for your work, it is very enlightening.
As far as I know, the diffusion model is stochastic, which means that for the same video the DEPTH result is different each time the inference is made. My question is, did you guys design anything specifically for this problem? Or are the results obtained for any given video also often good enough without eliminating the randomness?