hche11 / Localizing-Visual-Sounds-the-Hard-Way

Localizing Visual Sounds the Hard Way
Apache License 2.0
72 stars 14 forks source link

The reason why using theshold (0.5) in cal_CIoU #14

Open Sunjuhyeong opened 1 year ago

Sunjuhyeong commented 1 year ago

Hi!

Why do you use the theshold (0.5) in cal_CIoU, although the training doesn't give any information about the 0.5? In other words, is it just from the hyp-param tunning, or reasoned from mathematical properties behind the contrastive loss ?

* The reason what I'm asking is that recent papers Learning Audio-Visual Source Localization via False Negative Aware Contrastive Learning , A Closer Look at Weakly-Supervised Audio-Visual Source Localization use relative prediction, which always choose the 50% region as the prediction results, without any thresholdm so I just become curious :)

CleyLyChen commented 3 months ago

Hi!

Why do you use the theshold (0.5) in cal_CIoU, although the training doesn't give any information about the 0.5? In other words, is it just from the hyp-param tunning, or reasoned from mathematical properties behind the contrastive loss ?

The reason what I'm asking is that recent papers Learning Audio-Visual Source Localization via False Negative Aware Contrastive Learning , A Closer Look at Weakly-Supervised Audio-Visual Source Localization use relative prediction, which always choose the 50% region as the prediction results, without any thresholdm so I just become curious :)

Hi, I think i can answer your question, you can find the define of ciou in this paper: "Learning to Localize Sound Source in Visual Scenes", they wrote in Results and analysis as: image