what's the shape of the input batch when training?

noahzn / Lite-Mono

[CVPR2023] Lite-Mono: A Lightweight CNN and Transformer Architecture for Self-Supervised Monocular Depth Estimation

MIT License

527 stars 58 forks source link

Closed renyuanzhe closed 5 months ago

renyuanzhe commented 6 months ago

Is it something like [batch_size, sequence length, h ,w ,3]?

noahzn commented 6 months ago

The shape is [batch_size, C, H, W]

renyuanzhe commented 6 months ago

The shape is [batch_size, C, H, W]

Doesn't it need img pairs in a sequence to train the model?

noahzn commented 6 months ago

It uses adjacent frames to compute the loss, but the prediction is on a single frame.

renyuanzhe commented 6 months ago

so the gt of the input img is the adjacent frame of the input img ?

noahzn commented 6 months ago

No. You can read the paper or the source code to get a quick idea of how self-supervised depth estimation works.