Open hnczgytmp opened 3 months ago
Thank you for your interest in our work!
Regarding your questions, our answers are as follows:
Our main idea is to use self-correlation to enhance the semantic correlation between local patches. Actually both the inner product and cosine can achieve this, but in order to complete the semantic segmentation module later, we introduce the clustering module. The infinite value range of the inner product caused trouble for our hyperparameter selection of clustering, so we introduced cosine. This allows us to maintain uniform hyperparameters for different datasets and improve versatility.
The core idea of denoising is: If the attention response of a local patch to itself is not the largest (that is, it is smaller than the response to the global patch), then it is likely to be noise. This is what we want to express in our paper. We are still working on and optimizing this method. We later discover that the attention weights of the cls token in the deep layers (not the last layer, which will be mentioned later) are all concentrated on the global patches, so the simple (attn weights - cls weights) can achieve the same screening idea, and is more concise in writing.
Regarding the use of the penultimate layer, in the end of A.1 in the supplementary material we said: The global patch phenomenon is slightly alleviated in the last layer due to the modal alignment objective function. So if we want the "noise filtering" operation in Denoising module to be more effective (that is, global patch phenomenon is easier to identify), we naturally need to use the penultimate layer.
Thank you again for your attention to this work. Hope my answers can help you.
------------------ 原始邮件 ------------------ 发件人: "leaves162/CLIPtrase" @.>; 发送时间: 2024年8月21日(星期三) 晚上11:01 @.>; @.***>; 主题: [leaves162/CLIPtrase] Questions about your implementation (Issue #1)
Thanks for your inspiring work.
Upon running your code, I noticed a couple of inconsistences from what was outlined in your paper:
Semantic Correlation Computing.
In your implementation, you derive the attention weights of the 12th layer in the following manner: image.png (view on web) image.png (view on web) It appears, however, that this approach is not explicitly mentioned in the original paper. I am uncertain whether I may have misinterpreted your paper. Could you kindly explain why you genrerate the attention weights in this specific manner?
Denoising
In the original paper, you indicate identifying noisy clusters using the criteria: image.png (view on web)
From my understanding, w_{i,i} represents self-correlation. Yet, in your implementation: image.png (view on web) You subtract the attention score of the "cls" token and other patches, which appears to deviate from the paper's statement. Could you kindly explain this for me? Additionally, I noted that you use the attention weights of the penultimate layer instead of the last layer for DBSCAN.
Could you assist me in clarifying the doubts I have mentioned above? Thank you.
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.Message ID: @.***>
Thank you for your timely response, which has been incredibly helpful and has provided me with a wealth of insights.
However, I still have some cofusions:
For semantic correlation computing, I am still grappling with why you opt for the summation (with values potentially greater than 1) of three softmax functions without normalization as follows:
Why not keep it like the same (q-k simlairity) with previous (1-11) layers?
or use the averaged semantic correlation matrix to aggregate the values?
For denoising, I'm wondering why you adpot the in-place operation here.
As per my understanding, we could interpret attn_weights < 0
as an indicator to identify which patch is the 'global patch'. However, it appears that you are simultaneously calibrating all attention weights. Could you elaborate on the motivation behind this decision?
Your insights into these questions would be greatly appreciated. Thank you.
I'm glad the answers can help you. Our answers to your further questions are as follows:
First, if we keep it like the same q-k simlairity with previous 11 layers, the idea of self-correlation we mentioned is gone.
Second, why "softmax(qqcos+kkcos+vvcos/3.0)" is not used is mainly because of two reasons:1) Softmax is sensitive to values. If we do this, we need to design the reasonable temperature coefficient. 2) We found that the direct sum of these three factors weakens the local connection, which may be related to their different learning focuses (the different linear layers for q, k, v). So we chose to perform softmax first and then add. As for the principle of keeping the weight sum equal to 1, the most natural idea is to perform softmax on the result again, but it will return to the problem of temperature coefficient design we mentioned. In addition, we found that the sum of (softmax) wouldn't harm performance through experiments, so we finally adopted this approach.
Pure "attn weights" are difficult to use as an indicator function for finding global patches, or are not very effective. So we use cls tokens as threshold (pure cls token < 0 can also achieve a similar effect, but requires, for example, 8 or 9 layers of weights). In addition, if the global patch is directly filtered out, it is difficult to ensure that there will be no accidental deletions, and changes in the number of patches may cause changes in clustering hyperparameters.
发件人: @. @.> 代表 hnczgytmp @.> 发送时间: 2024年8月21日 19:59 收件人: leaves162/CLIPtrase @.> 抄送: 叶子 @.>; Comment @.> 主题: Re: [leaves162/CLIPtrase] Questions about your implementation (Issue #1)
Thank you for your timely response, which has been incredibly helpful and has provided me with a wealth of insights.
However, I still have some cofusions:
For semantic correlation computing, I am still grappling with why you opt for the summation (with values potentially greater than 1) of three softmax functions without normalization as follows:
image.png (view on web)https://github.com/user-attachments/assets/04907c65-a77d-49b6-a044-a9877e3e0d2b
Why not keep it like the same (q-k simlairity) with previous (1-11) layers?
image.png (view on web)https://github.com/user-attachments/assets/86c78dbc-5930-45dd-87eb-64684ec7c016
or use the averaged semantic correlation matrix to aggregate the values?
image.png (view on web)https://github.com/user-attachments/assets/3817e872-ed3a-49c5-9624-6112723de109
For denoising, I'm wondering why you adpot the in-place operation here.
image.png (view on web)https://github.com/user-attachments/assets/13b2d8cc-ce7e-473c-b042-c3c41379d22d
As per my understanding, we could interpret attn_weights < 0 as an indicator to identify which patch is the 'global patch'. However, it appears that you are simultaneously calibrating all attention weights. Could you elaborate on the motivation behind this decision?
Your insights into these questions would be greatly appreciated. Thank you.
— Reply to this email directly, view it on GitHubhttps://github.com/leaves162/CLIPtrase/issues/1#issuecomment-2303580018, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ANJ7OD2QVMACV6YSROZ22T3ZSVHX3AVCNFSM6AAAAABM4HD522VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMBTGU4DAMBRHA. You are receiving this because you commented.
Thanks for your inspiring work.
Upon running your code, I noticed a couple of inconsistences from what was outlined in your paper:
In your implementation, you derive the attention weights of the 12th layer in the following manner:
It appears, however, that this approach is not explicitly mentioned in the original paper. I am uncertain whether I may have misinterpreted your paper. Could you kindly explain why you genrerate the attention weights in this specific manner?
In the original paper, you indicate identifying noisy clusters using the criteria:
From my understanding, w_{i,i} represents self-correlation. Yet, in your implementation:
You subtract the attention score of the "cls" token and other patches, which appears to deviate from the paper's statement. Could you kindly explain this for me? Additionally, I noted that you use the attention weights of the penultimate layer instead of the last layer for DBSCAN.
Could you assist me in clarifying the doubts I have mentioned above? Thank you.