jhaoshao / ChronoDepth

ChronoDepth: Learning Temporally Consistent Video Depth from Video Diffusion Priors
MIT License
147 stars 4 forks source link

Inference Strategy #1

Closed ChambinLee closed 1 month ago

ChambinLee commented 1 month ago

Thank you for your work, it is very enlightening.

As far as I know, the diffusion model is stochastic, which means that for the same video the DEPTH result is different each time the inference is made. My question is, did you guys design anything specifically for this problem? Or are the results obtained for any given video also often good enough without eliminating the randomness?

jhaoshao commented 1 month ago

Hi, thank you for noticing our work :). Indeed, when there is a lack of shared information across clips, the model can potentially struggle with inconsistent predictions. To address this, we've employed a sliding window strategy to facilitate the exchange of temporal information between clips. I would recommend referring to Section 3.3 in our paper https://arxiv.org/pdf/2406.01493 for a more detailed explanation. Additionally, we've provided ablation results concerning the overlap between clips for further insights.

ChambinLee commented 1 month ago

Hi, thank you for noticing our work :). Indeed, when there is a lack of shared information across clips, the model can potentially struggle with inconsistent predictions. To address this, we've employed a sliding window strategy to facilitate the exchange of temporal information between clips. I would recommend referring to Section 3.3 in our paper https://arxiv.org/pdf/2406.01493 for a more detailed explanation. Additionally, we've provided ablation results concerning the overlap between clips for further insights.

Thank you for your reply.🥰🥰🥰

I noticed that you said “Inference Strategy” in your article, and I think you used a sliding window based inpainting strategy to ensure the depth continuity between video clips.

But my question may be different from continuity. I would like to ask you about the first video clip, it is the result of generating from a pure noise. Due to the randomness of the diffusion model, the generated result should have randomness. So does this randomness have an impact on the results? Is a strategy like averaging for example needed to combine the different results?

In fact, this question comes from section 3.4 of https://arxiv.org/pdf/2312.02145. I saw that they raised this question, so I wanted to ask if there is an impact of this issue on video-based depth estimation.

jhaoshao commented 1 month ago

We have indeed attempted to incorporate a test-time ensemble scheme, similar to Marigold. However, contrary to expectations, our empirical findings suggest that it was ineffective when combined with EDM or video-based depth estimation. It might be a worthwhile avenue to explore further :)

ChambinLee commented 1 month ago

We have indeed attempted to incorporate a test-time ensemble scheme, similar to Marigold. However, contrary to expectations, our empirical findings suggest that it was ineffective when combined with EDM or video-based depth estimation. It might be a worthwhile avenue to explore further :)

Thank you very much for your reply, I have understood it. 🤗 I will close this issue.