I notice that you do not apply detach operation when you do contrastive learning, which means the loss gradient applies on both query and keys. I was wondering if this is a common practice because in other works based on contrastive learning, the gradients only flow through query samples. Does it make any difference on training? Any hints will be appreciated.
I notice that you do not apply detach operation when you do contrastive learning, which means the loss gradient applies on both query and keys. I was wondering if this is a common practice because in other works based on contrastive learning, the gradients only flow through query samples. Does it make any difference on training? Any hints will be appreciated.