Closed xwjabc closed 1 year ago
Hi Weijian,
You are right that cosine similarity is okay here. The reason is that sim(x, x') = x^Tx' / (||x|| ||x'||). The denominator here gives the correct scaling factor, just like in the attention case with Q and K.
Gotcha. Thank you for your response!
Hi, I have a question regarding the use of muP in contrastive losses: Assume we have anchor embedding x, positive embedding x_pos, and negative embedding x_neg. All x, x_pos, and x_neg are C-dim vectors where C represents the width that is categorized as an infinite dimension. The loss L is formulated as:
L = -log( exp(sim(x, x_pos)) / (exp(sim(x, x_pos)) + exp(sim(x, x_neg))) )
where sim(a, b) = cos(a, b) for each embedding pair. It seems the sim() merges two infinite-dim vectors to a finite one, which is similar to the Q K^T operation in self-attention. However, the difference is that the cosine similarity already bounds the output. Thus, I wonder if there is anything we need to change in the loss function when we use muP? Thanks!