lixinustc / GraphAdapter

The efficient tuning method for VLMs
74 stars 1 forks source link

question about text knowledge subgraphs #2

Closed hzau96yhz closed 10 months ago

hzau96yhz commented 11 months ago

Hi! Your article inspired me a lot!But I have a small question. The article mentions "given one downstream task with K classes, the nodes set Ct are obtained with the mean feature of the prompts from the different class, ..."in the text knowledge subgraph.In general, k categories will have k prompts for downstream task, so what does the "mean feature of the prompts from different class" described in the text mean? Or is it understood that the corresponding prompts of the same class of training samples are input into the text encoder, and then embedded into the mean?

lixinustc commented 11 months ago

Sorry for the late response since the CVPR deadline. Here, it refers to the fact that for each class, if we have m prompts, we will input all prompts into the textual encoder and then mean the embedding as one node of the graph. We will release the code soon.

hzau96yhz commented 11 months ago

Thank you for your reply!If each class has m prompts, are all m prompts of the same form? (For example, a photo of cat)

lixinustc commented 11 months ago

Since the CVPR deadline, I will provide the response after CVPR. Sorry for that.

lixinustc commented 10 months ago

When the class has m prompts, all prompts can share the different form, and you can average them for the output embeddings of textual encoder.

PixelChen24 commented 10 months ago

What about the node in the visual knowledge graph? In your paper we pass the augmented image group from the same class into a visual encoder to obtain their visual features, and then compute the mean features of them as the nodes $C_v$={c_v^i}_{i=1}^{K}\in \R ^{K\times d}. What does it mean? For example the class "dog", do you mean calculate all samples of "dog" along with their augmented images, and compute the mean? Or you just randomly pick one sample of "dog" along with its augment images?