ttengwang / PDVC

End-to-End Dense Video Captioning with Parallel Decoding (ICCV 2021)
MIT License
201 stars 23 forks source link

Some questions for your paper #1

Closed john2019-warwick closed 2 years ago

john2019-warwick commented 3 years ago

Hello, Teng

I have read your PDVC paper and run this code, it is a very good work! However, there are some points I can't understand in the paper, could you explain it?

  1. I can't get how to attain N queries in the flow chart, in this paper, seems there are no anchors, is it also by some anchors pre-set and the order of confidence score?
  2. In Table 3 of this paper, in line 9 of this table (MT [31] with TSN feature), why it is re-evaluated, is it due to different evaluation tools or different features used? Also, I found the Meteor score in Table1, [31] is 9.25, not the same with the re-evaluated value 4.98, could you help me with this?
  3. Could you explain what is the difference of PDVC light and PDVC methods? You can use Chinese if you prefer. Thank you very much!
ttengwang commented 3 years ago
  1. The references points serve as an initial guess of the locations of possible events. They are updated at each decoder layer and iteratively approach the gt event set. To some extent, the reference points can be seen as "learnable anchors" . Refer to DETR and Deformable DETR for more information.
  2. Because MT use an older version of evaluationion tool, see https://github.com/salesforce/densecap/issues/16#issuecomment-519323542
  3. They have different captioning heads, see the last paragraph on Page 5 in the paper.