Closed hzau96yhz closed 10 months ago
Sorry for the late response since the CVPR deadline. Here, it refers to the fact that for each class, if we have m prompts, we will input all prompts into the textual encoder and then mean the embedding as one node of the graph. We will release the code soon.
Thank you for your reply!If each class has m prompts, are all m prompts of the same form? (For example, a photo of cat)
Since the CVPR deadline, I will provide the response after CVPR. Sorry for that.
When the class has m prompts, all prompts can share the different form, and you can average them for the output embeddings of textual encoder.
What about the node in the visual knowledge graph?
In your paper we pass the augmented image group from the same class into a visual encoder to obtain their visual features, and then compute the mean features of them as the nodes $C_v$={c_v^i}_{i=1}^{K}\in \R ^{K\times d}
. What does it mean?
For example the class "dog", do you mean calculate all samples of "dog" along with their augmented images, and compute the mean? Or you just randomly pick one sample of "dog" along with its augment images?
Hi! Your article inspired me a lot!But I have a small question. The article mentions
"given one downstream task with K classes, the nodes set Ct are obtained with the mean feature of the prompts from the different class, ..."
in the text knowledge subgraph.In general, k categories will have k prompts for downstream task, so what does the"mean feature of the prompts from different class"
described in the text mean? Or is it understood that the corresponding prompts of the same class of training samples are input into the text encoder, and then embedded into the mean?