ttengwang / PDVC

End-to-End Dense Video Captioning with Parallel Decoding (ICCV 2021)
MIT License
200 stars 23 forks source link

Paper understanding #17

Closed saharshleo closed 2 years ago

saharshleo commented 2 years ago

Hello, What is the exact dimensions of input to deformable transformer encoder? From what I understood:

So the input is of TxL temporal dimension right?

ttengwang commented 2 years ago

Hi, the $L$ convolutional layers all have a stride K=2, so resultant temporal dimension is $T + T/K + ... + T/K^{L}$