Open thinker9527 opened 6 months ago
Thanks for your question. Here visual nodes denote the node_cluster_i, instead of the inputs_img. It is a redundancy in this code. Thanks.
Thanks for your reply. I'm not sure I understand correctly that nodes in the Visual knowledge subgraph are included in both text and visual node
there is something else I don't quite understand and would like to ask you about it.
node_cluster_t = self.base_text_features.view(1, self.base_text_features.size()[0]//4, 4, self.base_text_features.size()[1])
Why does node_cluster_t.shape[2]=4? On what basis is this number set? Would a different one have an effect on the results of the experiment?
Thanks for your question. Here, 4 is for the graph decomposition, which aims to reduce the complexity of fully connected graph in this work since the ImageNet has 1000 classes. You can only utilize one actually for others, or you have memeory larger than 3090Ti.
About the graph nodes, It might exist some misunderstanding here, each nodes are computed with the average embedding of the few-shot visual embedding from the same classes. We do not set the subgraph in each node.
Thank you for your patience in replying.
Thanks for your question. Here, 4 is for the graph decomposition, which aims to reduce the complexity of fully connected graph in this work since the ImageNet has 1000 classes. You can only utilize one actually for others, or you have memeory larger than 3090Ti.
After I changed the value of 4 to 1 and tested it on caltech101 with the command parameters unchanged, the accuracy rate increased from 93.1% to 93.5%. I would like to ask whether artificially decomposing the graph will affect the transmission of graph information. If all nodes are on one graph, will we get more information in theory?
in baseclip_graph_v1.py: 215-220: inputs_text = self.base_text_features.unsqueeze(dim=1) #[100, 1, 1024] inputs_img = img_feature.unsqueeze(dim=1) node_cluster_tt = node_cluster_t[:, :, index, :].repeat(inputs_text.size()[0], 1, 1) #[100, 100, 1024] t->t node_cluster_it = node_cluster_i[:, :, index, :].repeat(inputs_text.size()[0], 1, 1) # i -> t feat_tt = torch.cat([inputs_text, node_cluster_tt], dim=1) feat_it = torch.cat([inputs_text, node_cluster_it], dim=1)
Is inputs_img useless?
In paper, "As shown in Fig. 3, to construct the visual knowledge sub-graph Gv = {Cv, Ev}, we pass the augmented image group from the same class into visual encoder to obtain their visual features, and then compute the mean features of them as the nodes."
feat_it =torch.cat([inputs_img , node_cluster_it], dim=1) is right ??
I am confused about this problem and would appreciate your response. Thank you very much.