showlab / UniVTG

[ICCV2023] UniVTG: Towards Unified Video-Language Temporal Grounding
https://arxiv.org/abs/2307.16715
MIT License
301 stars 24 forks source link

Questions about fine-tuning #28

Closed yeliudev closed 8 months ago

yeliudev commented 8 months ago

Hi @QinghongLin, many thanks for sharing this great work! I was wondering when fine-tuning UniVTG on downstream datasets without curve (highlight) labels (e.g., NLQ, Charades-STA, TACoS), did you still use "CLIP teacher" method to obtain pseudo labels? In other word, are the results of UniVTG and UniVTG w/PT in Table 3 obtained by using pseudo highlight labels?

QinghongLin commented 8 months ago

Hi @yeliudev in downstream fine-tuning, we do not derive any addition labels. We use the original annotations. Label deriving is only been used during pretraining corpus creation.

yeliudev commented 8 months ago

@QinghongLin Thanks for your reply! Since additional labels are not used, are loss_s_inter and loss_s_intra also discarded? It seems that we do not know which clips have lower saliency scores than saliency_pos_labels.

QinghongLin commented 8 months ago

Oh Sorry for confusion. Let me clarify it.

In Tab. 3, all three losses are used, we can derive saliency score by their manual interval windows e.g., inside is greater than outside, but we don't know the exact saliency score number. In this case, we use the original interval for saliency and provide supervision for three losses. We do not use CLIP teacher for them to get exactly saliency scores.

it mean, the three losses can be flexible used with or without exactly saliency score.

yeliudev commented 8 months ago

I see. Seems like the downstream tasks can still benefit from this weak supervision. That is interesting. Thank you again for your kind response!

QinghongLin commented 8 months ago

You are welcome. We have provide such ablation studies in our supplement, you can take a lot for this.

image