muzairkhattak / ProText

[CVPRW 2024] Official repository of paper titled "Learning to Prompt with Text Only Supervision for Vision-Language Models".
https://muzairkhattak.github.io/ProText/
MIT License
86 stars 4 forks source link

Inquiry about Implementing Text Description Generation with LLM in Your Model #3

Closed SHIBOYA closed 4 months ago

SHIBOYA commented 6 months ago

Dear Muhammad,

I hope this message finds you well. I have been deeply engaged with your work on utilizing the CLIP model for visual tasks, and I am thoroughly impressed by your innovative approach and the results achieved. Specifically, I find the idea of using large language models (LLMs) like GPT-3 to generate textual descriptions for each category, and incorporating these descriptions into training, to be particularly intriguing. This method holds promising potential for enhancing model performance by learning and extracting rich contextual knowledge from LLM data.

In reviewing your code implementation, I endeavored to understand how this process was explicitly implemented but seemed to miss the specific sections that detail this. It might be that I overlooked some details, or the process is represented in the code more subtly than I anticipated. Hence, I would like to ask a few questions:

  1. Did you directly utilize GPT-3 or other large language models to generate textual descriptions for each category within your model training workflow? If so, how was this process integrated into your model training?

  2. Regarding the generated textual descriptions, how did you integrate and utilize them for model training?

  3. If possible, could you share some practical examples or code snippets on implementing and utilizing these generated textual descriptions? It would be immensely helpful for understanding and applying your method to my own research projects.

I sincerely believe that grasping these details will enable me to better comprehend your work and explore the possibilities of applying this concept to my research. I look forward to your response and once again thank you for sharing this exciting piece of research.

Best regards!

SHIBOYA commented 6 months ago

During the training process, the input file consists of label and text description. How is this done? Have some functions in dataloader been changed?

SHIBOYA commented 6 months ago

It seems that the templates folder is not used in the code.

muzairkhattak commented 6 months ago

Hi @SHIBOYA,

Thank you for showing interest in ProText!

Regarding your questions, kindly note the following.

Did you directly utilize GPT-3 or other large language models to generate textual descriptions for each category within your model training workflow? If so, how was this process integrated into your model training?

We do not directly utilize LLMs during the training workflow, instead, we use LLMs offline. First, we offline generate the textual descriptions for each training class from LLMs like GPT and store them in JSON files (saved here). If you want to regenerate the textual descriptions, you can use the code of CuPL baseline here.

Regarding the generated textual descriptions, how did you integrate and utilize them for model training?

We use the textual descriptions to learn a mapping from "a photo of a CLS" => detailed description for that class obtained from LLM. We have modified the data loader of the dassl library to obtain such text-to-text pairs.

If possible, could you share some practical examples or code snippets on implementing and utilizing these generated textual descriptions? It would be immensely helpful for understanding and applying your method to my own research projects.

You can see how the CLIP text encoders are used in the forward pass at protext.py as shown here.

During the training process, the input file consists of a label and text description. How is this done? Have some functions in the dataloader been changed?

Yes, we have changed the data loader to get text-to-text examples. The PyTorch data loader is implemented here.

I hope this is helpful. Feel free to ask if there is still any query.

SHIBOYA commented 6 months ago

For the text generated by GPT, how to ensure that the dimensions of target_embed and outputs generated by multiple different text descriptions are equal?

SHIBOYA commented 6 months ago

The timely advice and professional guidance you offered are incredibly valuable to me.

SHIBOYA commented 6 months ago

Dear Muhammad,

I have encountered a challenge regarding the optimal way to simultaneously load both image and sentence data during training. Specifically, I noticed that in your implementation, when dealing with textual data, you've adopted a technique where each text entry is replicated * (len(input_text)).

My question is, for a scenario where both images and their corresponding sentences need to be loaded concurrently, would you recommend applying a similar replication strategy for images to align with the sentence data in terms of tensor dimensions? Or is there a more efficient method or best practice that you could suggest for handling this type of multimodal data loading?

Best regards!

muzairkhattak commented 5 months ago

Dear Shiboya,

Thank you for your message.

I think you do not need to perform any replication as done in the original code if you already have image-text pairs as the multi-modal data.

This replication part needs to be done with text-only data.

I hope this is helpful. Thank you and kind regards!

muzairkhattak commented 4 months ago

Hi @SHIBOYA,

I am closing this issue for now. Feel free to reopen if there is still any question.

Thank you!