showlab / UniVTG

[ICCV2023] UniVTG: Towards Unified Video-Language Temporal Grounding
https://arxiv.org/abs/2307.16715
MIT License
322 stars 29 forks source link

Questions about fine-tuning #28

Closed yeliudev closed 1 year ago

yeliudev commented 1 year ago

Hi @QinghongLin, many thanks for sharing this great work! I was wondering when fine-tuning UniVTG on downstream datasets without curve (highlight) labels (e.g., NLQ, Charades-STA, TACoS), did you still use "CLIP teacher" method to obtain pseudo labels? In other word, are the results of UniVTG and UniVTG w/PT in Table 3 obtained by using pseudo highlight labels?

QinghongLin commented 1 year ago

Hi @yeliudev in downstream fine-tuning, we do not derive any addition labels. We use the original annotations. Label deriving is only been used during pretraining corpus creation.

yeliudev commented 1 year ago

@QinghongLin Thanks for your reply! Since additional labels are not used, are loss_s_inter and loss_s_intra also discarded? It seems that we do not know which clips have lower saliency scores than saliency_pos_labels.

QinghongLin commented 1 year ago

Oh Sorry for confusion. Let me clarify it.

In Tab. 3, all three losses are used, we can derive saliency score by their manual interval windows e.g., inside is greater than outside, but we don't know the exact saliency score number. In this case, we use the original interval for saliency and provide supervision for three losses. We do not use CLIP teacher for them to get exactly saliency scores.

it mean, the three losses can be flexible used with or without exactly saliency score.

yeliudev commented 1 year ago

I see. Seems like the downstream tasks can still benefit from this weak supervision. That is interesting. Thank you again for your kind response!

QinghongLin commented 1 year ago

You are welcome. We have provide such ablation studies in our supplement, you can take a lot for this.

image