linyq2117 / CLIP-ES

MIT License
175 stars 9 forks source link

why is sinkhorn? #1

Closed fsted closed 1 year ago

fsted commented 1 year ago

Hi,this is a great work,it's surprise.i have some question about it. I want to know why you choose the sinkhorn to compute the Attention-based Affinity map ,there are many other method. Second , maybe there is a little miss of paper : The formula (8) ,It seems that there is a lack of parentheses, which makes me confused. Last,Do you plan to release the training code? I'm looking forward to it😄

linyq2117 commented 1 year ago

Thanks for your interest in our work!

(1) We use sinkhorn normalization to make attention weight matrix symmetric. The affinity between (a, b) and (b, a)should be the same intuitively while the attention weight matrix is calcaulated between query and key (not equal). It is true that there are other methods to realize it, such as calculating the similarity between key, or directly treating attention weight as affinity matrix. But leveraging sinkhorn normalization would produce a little higher performance in our experiment.

(2) We actually missed a parentheses. Thanks for correcting us and we will update it soon.

(3) Our method is training-free when generating pseudo segmentation. If your mean training the final segmentation model with pseudo masks, you can simply follow deeplab-pytorch and use the setting depicted in the paper.

fsted commented 1 year ago

Thanks for your explain,i can understand your means. btw, where's the code of Confidence-guided Loss? I can‘t find it.

linyq2117 commented 1 year ago

We do not explicitly adopt confidence-guided loss. Instead, we set segmentation mask to 255 if the max confidence of this pixel less than 0.95 for convenience. The label 255 will be ignored by nn.CrossEntropyLoss in deeplab-pytorch. This process is realized in the latest eval_cam_with_crf.py.