jpthu17 / DiCoSA

[IJCAI 2023] Text-Video Retrieval with Disentangled Conceptualization and Set-to-Set Alignment
Apache License 2.0
49 stars 2 forks source link

Discrepancy between paper and code regarding attention pooling temperature #4

Closed knightyxp closed 1 year ago

knightyxp commented 1 year ago

Hello,

While going through your paper and code, I noticed a discrepancy regarding the temperature parameter used in attention pooling. In the paper, it's mentioned that the softmax temperature is set to 0.01. However, in the code, the default temperature value appears to be 5, and in practice it seems to be set to 3.

Could you please clarify what the correct value of the temperature should be? It would be greatly appreciated if you could provide an explanation for the differences between these values.

Thanks in advance for your time and assistance.

Best regards

jpthu17 commented 1 year ago

There are two temperature parameters in our work.

The first is the temperature $\tau$ used to aggregate the frame representation, corresponding to Eq. 1 in the paper and the link of code. The smaller $\tau$ allows visual features to take more textual conditions into account during aggregation. We set $\tau=3$ in practice.

The second is the temperature $\tau'$ in the InfoNEC loss, corresponding to Eq. 10 in the paper and the link of code. Following CLIP, we set $\tau'=0.01$ in practice.

Best regards

knightyxp commented 1 year ago

Thank you for your response. I have noticed that the temperature $\tau$ setting in the attention pooling mechanism of text conditioned frame features varies across different works, some use 0.01 while others use 3 or 5.

It's understood that a smaller temperature value generally leads to sharper, more distinctive features which would align more closely with a specific text-sensitive frame. On the other hand, a higher temperature smoothens the frame feature distribution. So, I am wondering whether the selection of 3 as the temperature in this case is based on experimental results?

Additionally, in the context of using a temperature $\tau'$ of 0.01, it seems to correspond to the logit_scale parameter in the CLIP model, i.e., logit_scale = self.clip.logit_scale.exp(). Shouldn't this logit_scale be trainable in practice?

Thank you for your time, and I look forward to your clarification on this.

Best regards.

jpthu17 commented 1 year ago

In the experiment, we found that temperature $\tau$ set to 3 or 5 does not make a significant difference. However, we do not recommend setting $\tau$ to 0.01, as this will make the result bad.

In our code, $\tau'$ is initialized by the temperature in CLIP (the initial value is almost 0.01). We find that $\tau'$ will gradually increase with training. Following CLIP issue 46, we clip the temperature to ensure that $\tau'$ is always 0.01. The link of code is here.

Best regards.