microsoft / mup

maximal update parametrization (µP)
https://arxiv.org/abs/2203.03466
MIT License
1.24k stars 88 forks source link

muP for contrastive losses #20

Closed xwjabc closed 1 year ago

xwjabc commented 2 years ago

Hi, I have a question regarding the use of muP in contrastive losses: Assume we have anchor embedding x, positive embedding x_pos, and negative embedding x_neg. All x, x_pos, and x_neg are C-dim vectors where C represents the width that is categorized as an infinite dimension. The loss L is formulated as:

L = -log( exp(sim(x, x_pos)) / (exp(sim(x, x_pos)) + exp(sim(x, x_neg))) )

where sim(a, b) = cos(a, b) for each embedding pair. It seems the sim() merges two infinite-dim vectors to a finite one, which is similar to the Q K^T operation in self-attention. However, the difference is that the cosine similarity already bounds the output. Thus, I wonder if there is anything we need to change in the loss function when we use muP? Thanks!

edwardjhu commented 1 year ago

Hi Weijian,

You are right that cosine similarity is okay here. The reason is that sim(x, x') = x^Tx' / (||x|| ||x'||). The denominator here gives the correct scaling factor, just like in the attention case with Q and K.

xwjabc commented 1 year ago

Gotcha. Thank you for your response!