lixinustc / GraphAdapter

The efficient tuning method for VLMs
74 stars 1 forks source link

About Visual knowledge sub-graph #12

Open thinker9527 opened 6 months ago

thinker9527 commented 6 months ago

in baseclip_graph_v1.py: 215-220: inputs_text = self.base_text_features.unsqueeze(dim=1) #[100, 1, 1024] inputs_img = img_feature.unsqueeze(dim=1) node_cluster_tt = node_cluster_t[:, :, index, :].repeat(inputs_text.size()[0], 1, 1) #[100, 100, 1024] t->t node_cluster_it = node_cluster_i[:, :, index, :].repeat(inputs_text.size()[0], 1, 1) # i -> t feat_tt = torch.cat([inputs_text, node_cluster_tt], dim=1) feat_it = torch.cat([inputs_text, node_cluster_it], dim=1)

Is inputs_img useless?

In paper, "As shown in Fig. 3, to construct the visual knowledge sub-graph Gv = {Cv, Ev}, we pass the augmented image group from the same class into visual encoder to obtain their visual features, and then compute the mean features of them as the nodes."

feat_it =torch.cat([inputs_img , node_cluster_it], dim=1) is right ??

I am confused about this problem and would appreciate your response. Thank you very much.

lixinustc commented 6 months ago

Thanks for your question. Here visual nodes denote the node_cluster_i, instead of the inputs_img. It is a redundancy in this code. Thanks.

thinker9527 commented 6 months ago

Thanks for your reply. I'm not sure I understand correctly that nodes in the Visual knowledge subgraph are included in both text and visual node

thinker9527 commented 6 months ago

there is something else I don't quite understand and would like to ask you about it.

node_cluster_t = self.base_text_features.view(1, self.base_text_features.size()[0]//4, 4, self.base_text_features.size()[1])

Why does node_cluster_t.shape[2]=4? On what basis is this number set? Would a different one have an effect on the results of the experiment?

lixinustc commented 6 months ago

Thanks for your question. Here, 4 is for the graph decomposition, which aims to reduce the complexity of fully connected graph in this work since the ImageNet has 1000 classes. You can only utilize one actually for others, or you have memeory larger than 3090Ti.

lixinustc commented 6 months ago

About the graph nodes, It might exist some misunderstanding here, each nodes are computed with the average embedding of the few-shot visual embedding from the same classes. We do not set the subgraph in each node.

thinker9527 commented 6 months ago

Thank you for your patience in replying.

small-code-cat commented 1 day ago

Thanks for your question. Here, 4 is for the graph decomposition, which aims to reduce the complexity of fully connected graph in this work since the ImageNet has 1000 classes. You can only utilize one actually for others, or you have memeory larger than 3090Ti.

After I changed the value of 4 to 1 and tested it on caltech101 with the command parameters unchanged, the accuracy rate increased from 93.1% to 93.5%. I would like to ask whether artificially decomposing the graph will affect the transmission of graph information. If all nodes are on one graph, will we get more information in theory?