Query about the code. - Githubissues

sterzhang commented 2 months ago

First thanks for your great work, that really inspires me a lot! However, when following your code, we found that your classification layer is based on [feature] instead of [att]. It seems that the CLIP-bolstered [att] doesn't really be used to the final classification, which really confuses us. Maybe it is designed to do so, but it would be better if you could explain the reason behind this, really appreciated!

tmtuan1307 commented 2 months ago

Hi, yes, it is designed, and let me explain it. Given $[feature] \in R^d$ of any classifier model, we need to map it into the same dimension with CLIP's embedding at $512$ using $M \in R^{d\times 512}$. $[att] = M([feature])$

In case of using $[att]$ for final classifier $h$ (as your mentioned): $\hat{y} = h([att]) = h(M([feature]))$ We need to adjust the standard architecture of the model and increase its number of parameters, potentially making the comparison unfair.
In case of using $[feature]$ for final classifier $h$ (as our design) $\hat{y} =h([feature])$ The standard architecture and the number of parameters are kept unchanged.

Furthermore, since $[att] = M([feature])$, our Bounding Loss on $[att]$ can still impact and bound the $[feature]$ around CLIP's Label Text Embedding.

sterzhang commented 2 months ago

Got it, still wondering whether you have done some ablation studies to validate the introduce of mapping layer is better than not introduce. I notice you have only done the sensitive analysis of radius r in your ablation study, not mentioning the effectiveness of the mapping layer. Could you elaborate this further? Appreciate a lot!

------------------ Original ------------------ From: Minh-Tuan Tran @.> Date: Thu, Aug 1, 2024 0:14 PM To: tmtuan1307/LANDER @.> Cc: Jianshu Zhang @.>, Author @.> Subject: Re: [tmtuan1307/LANDER] Query about the code. (Issue #2)

sterzhang commented 2 months ago

Got it, still wondering whether you have done some ablation studies to validate the introduce of mapping layer is better than not introduce. I notice you have only done the sensitive analysis of radius r in your ablation study, not mentioning the effectiveness of the mapping layer. Could you elaborate this further? Appreciate a lot!

------------------ Original ------------------ From: Minh-Tuan Tran @.> Date: Thu, Aug 1, 2024 0:14 PM To: tmtuan1307/LANDER @.> Cc: Jianshu Zhang @.>, Author @.> Subject: Re: [tmtuan1307/LANDER] Query about the code. (Issue #2)

Hi, yes, it is designed, and let me explain it. Given $[feature] \in R^d$ of any classifier model, we need to map it into the same dimension with CLIP's embedding at $512$ using $M \in R^{d\times 512}$. $[att] = M([feature])$

In case of using $[att]$ for final classifier $h$ (as your mentioned): $\hat{y} = h([att]) = h(M([feature]))$ We need to adjust the standard architecture of the model and increase its number of parameters, potentially making the comparison unfair.

In case of using $[feature]$ for final classifier $h$ (as our design) $\hat{y} =h([feature])$ The standard architecture and the number of parameters are kept unchanged.

Furthermore, since $[att] = M([feature])$, our Bounding Loss on $[att]$ can still impact and bound the $[feature]$ around CLIP's Label Text Embedding.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

tmtuan1307 commented 4 days ago

We believe that one of the most important roles of the mapping is to map the [feature] to the same size as the text embedding. Please note that various latent embedding sizes are used in different backbones, and without the mapping, our method would not work in most cases.

tmtuan1307 / LANDER

Query about the code. #2