simon-ging / coot-videotext

COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning
Apache License 2.0
288 stars 55 forks source link

Does COOT uses ground truth timestamp to generate coherent caption(Paragraph)? #28

Closed DesaleF closed 3 years ago

DesaleF commented 3 years ago

When generating captions for testing or validation did you use the ground truth timestamp to generate each sentence or COOT can just generate the paragraph caption without using the ground truth timestamp?

simon-ging commented 3 years ago

We use ground-truth timestamps.

DesaleF commented 3 years ago

Okay!! Thank you. One more question: is there anyway to generate timestamp generation rather than using the ground truth. If you have any recommendation to generate a coherent paragraph with a proposed timestamp by the network.

simon-ging commented 3 years ago

You would need to use someone else's pretrained model for something like "dense event prediction" to get timestamps and then build the meta_all.json for your dataset with these new timestamps. Let us know if you find a good way to predict these timestamps.

DesaleF commented 3 years ago

Thank you very much for your suggestion. I will check that and I will post it here if I got success to predict the timestamps.