Some questions about SCM

rjy-fighting commented 1 year ago

Hello! I read your article carefully and was very interested in it! I have some questions as follows:

(1) Does the semantic similarity matrix E calculate the semantic similarity between all patchs? (2) After I print E, I find a negative value. What does a negative value in E mean? (For example,the negative value -0.0383 in the first row) tensor([[[ 1.0000, 0.3413, 0.3903, ..., 0.1250, -0.0383, 0.1996], [ 0.3413, 1.0000, 0.4638, ..., 0.0055, 0.0692, 0.2095], [ 0.3903, 0.4638, 1.0000, ..., 0.0800, -0.1332, 0.2198], ...,

(3) Does SCM diffuse only according to the semantic and spatial relations of the four points of its first-order neighbors?

Hope to get your reply! Thank you very much!

rjy-fighting commented 1 year ago

I have another question, that is, is the evaluation metric GT-Known compared in the paper GT-Known top-1 or GT-Known top-5?

rjy-fighting commented 1 year ago

May I know the environment in which the experiment was conducted?（For example，GPU）

hbai98 commented 1 year ago

HI! Sorry for the late reply.

(1) Yes, the E matrix in Eq.(3) is the normalized outer product of the whole vertex set $V$ and $V$ itself, thus $E_{i,j}$ denotes the arbitrary nodes' semantic similarity.

(2) For the arbitrary cosine distance $E_{i,j} = \frac{v_i^T v_j}{||v_i^T|||v_j^T||}$, as the the inner product $v_i^T v_j$ can be negative or positive.

Wikipedia's definition of cosine distance: The resulting similarity ranges from −1, meaning exactly opposite, to 1, meaning precisely the same, with 0 indicating orthogonality or decorrelation,

So the negative values represent they are not alike.

(3) It's correct to some extent, as we illustrated in the suppl. materials, for simplicity, we only consider the first-order neighbors, meaning the four points. (You can experiment on the difference of connecting the second-order or more!)

The semantic and spatial relations have been leveraged by SCM to diffuse the raw attention to cover the complete objects. Notice the critical design is the semantic relations are constantly updated by different ADB layers, which is shown in Fig.6, and accordingly, the updated E will revise the later diffusion status.

I think the most intriguing thing is that diffusion can actually be done using one layer! I hypothesize that there should be some Reinforcement learning tricks to use one layer and receives the signal after each iteration step in Eq.(6) since we aim to find the intermediate status that the attention happens to capture the object.

hbai98 commented 1 year ago

I have another question, that is, is the evaluation metric GT-Known compared in the paper GT-Known top-1 or GT-Known top-5?

GT-Known normally uses the top-1. We give the top-k values only for convenience.

hbai98 commented 1 year ago

know

May I know the environment in which the experiment was conducted?（For example，GPU）

We use A100 with memory that is able to support batch-size 256, maybe 40 GB. I don't remember very clearly. Let me know if there are further questions.

rjy-fighting commented 1 year ago

I benefited a lot! Thank you for your reply!

know

May I know the environment in which the experiment was conducted?（For example，GPU）

We use A100 with memory that is able to support batch-size 256, maybe 40 GB. I don't remember very clearly. Let me know if there are further questions.

I benefited a lot! Thank you for your reply!

hbai98 / SCM

Some questions about SCM #2