Open NielsRogge opened 1 year ago
Hi @NielsRogge , can work on it?
Sure @atturaioe feel free to start working on it
I am writing to inquire about the possibility of me starting work on this issue. @NielsRogge can I contribute?
@NielsRogge is this issue still open for contribution?
Yes
@NielsRogge If nobody is working on it, I would like to pick up the issue.
I would like to pick the issue if its still available.
@NielsRogge is this issue still open to contribute . I would like to work on it
Support for BLIP in the image-to-text pipeline has been added in #21904. GIT can be added as explained in this comment, feel free to open a PR.
Support for the VQA pipeline still needs to be added for both, also there contributions are welcome.
@NielsRogge can I work on this issue??
Hello @NielsRogge !
I would like to work on this issue (add support for VQA to GIT model) as a first contribution.
But before I start, I have a question :
Currently the only model implementing the VQA pipeline is ViltForQuestionAnswering
, it does the task using classification
However in GIT paper they say that :
For VQA, the input question is treated as a text prefix, and the answer is generated in an auto-regressive way. Furthermore, we present a new generation-based scheme for ImageNet classification, where the predicted labels come directly from our generative model without pre-defining the vocabulary.
So I wonder if I should implement it as a classifier or should I follow the paper ?
Thanks
Hi @marechaux, we will need to implement the 2 different approaches in the VQA pipeline. ViLT and GIT indeed solve VQA entirely different (ViLT is a classifier whereas GIT is a generative GPT-like model).
Support for BLIP in the image-to-text pipeline has been added in #21904. GIT can be added as explained in this comment, feel free to open a PR.
Support for the VQA pipeline still needs to be added for both, also there contributions are welcome.
Hey @NielsRogge, took a shot at this. Am I correct in understanding that the ideal implementation of "microsoft/git-base" in the image-to-text pipeline would look something like this?
from transformers import AutoProcessor, GitForVision2Seq
processor = AutoProcessor.from_pretrained("microsoft/git-base")
model = GitForVision2Seq.from_pretrained("microsoft/git-base")
pipe = pipeline("image-to-text", model=model, image_processor=processor.image_processor, tokenizer=processor.tokenizer)
print(pipe("https://www.wideopenpets.com/wp-content/uploads/sites/6/2021/12/Popular-Horse-Feature-Image.png"))
If so, I got this to work by:
So the GITImageProcessor.preprocesor ends with this:
data = {"pixel_values": images}
return_data = BatchFeature(data=data, tensor_type=return_tensors)
return_data['input_ids'] = None
return return_data
rather than the CLIPImageProcessor.preprocessor method returning this
data = {"pixel_values": images}
return BatchFeature(data=data, tensor_type=return_tensors)
Curious your thoughts on this approach. How would this would affect other GIT image processing workflows (i.e. VQA, etc.)? Could we can use a conditional to account for those?
Thanks for taking a stab at this. I'm fine with adding a GitForVision2Seq
(as proposed by @Narsil) however it'd be great to not having to add a custom GITImageProcessor
. What's the reason this is added? Is it only to include "input_ids" which are set to None
?
Exactly this - 'only to include "input_ids" which are set to None?'
I see how adding an entirely new GITImageProcessor seems excessive when all it would do is add the Input_ids : None key value pair to data being returned from the .preprocess method.
As you describe here, https://github.com/huggingface/transformers/issues/21514#issuecomment-1446359970, Once we hit the preprocess method in ImageToTextPipeline and map the model to git, the model_inputs are returned (via the CLIPImageProcessor through the GITProcessor in processing_git.py) without the input_ids key. So AFAIK, the best we can do is modify the return value of the CLIPImageProcessor.preprocess method without changing the CLIPImageProcessor class by replicating the CLIPImageProcessor, rebranding it as a GITImageProcessor, and modify the .preprocess method.
Let me know if that works or if you feel there is a better approach. Is the idea that there would be some way to do this within GitForVision2Seq?
As an aside, I read some best practices for working in the transformers library (https://huggingface.co/transformers/v4.10.1/add_new_model.html#general-overview-of-transformers). Would it be preferable to copy the entire CLIPImageProcessor class as GITImageProcessor within processing_git.py or do something more like this within processing_git.py.
class GITImageProcessor(CLIPImageProcessor):
def preprocess(self, *args, **kwargs):
# Call the original preprocess method
return_data = super().preprocess(*args, **kwargs)
# Add 'input_ids' key to the data
return_data['input_ids'] = None
return return_data
Hmm I don't get why input_ids
need to be set to None
. Could you clarify?
This example shows that you only need to pass pixel_values
to the generate
method to do image captioning.
Hello, it seems that the BLIP for the image to text pipeline has been completed, however that the VQA pipeline for both BLIP & GIT are not complete, along with the image to text pipeline for GIT. @marechaux how is the VQA going for GIT?
Hi! I'm also interested in helping out if we can divide the work :)
Hey @NielsRogge , I was working on VQA pipeline for BLIP but i am confused how can i give pixel_values
to _forward
method in VisualQuestionAnsweringPipeline
(src) because BLIP requires pixel values and those are generated by preprocessor . Sorry if this is silly question because this is my first open source contribution .
Hi @Tanmaypatil123 there's already this PR: #23348. Feel free to take it over/improve it
Hello, can I work on this?
Hi Team, Can I start working on it ?
Hi @NielsRogge, I would like to try to add GIT for VQA as my first contribution, is it ok? I looked at #23348 , and I want to know if it is fine to return the full generated text, I make it work locally, so I could prepare a PR if no one else is working on this.
I believe the input_ids or its lenght could be used in the postprocess of VisualQuestionAnsweringPipeline to remove the prompt/prefix, like in TextGenerationPipeline, but it will require to do a refactor in _forward in VisualQuestionAnsweringPipeline to return also the input_ids.
Is this still open for contribution? Would love to help out.
Hi @astern21, I started a draft PR, but didn't get to finish it, and now I'm not really working on it.
Feature request
BLIP and GIT are 2 recent additions in the library, providing state-of-the-art performance for tasks like image captioning and visual question answering (VQA). GIT is even capable of video captioning and video QA.
Hence it makes sense to support them in our image-to-text and VQA pipelines.
Motivation
Having support for better models in pipelines is very desired!
See also a request for it here: https://discuss.huggingface.co/t/support-for-different-models-in-text-to-image-pipeline/29504
Your contribution
I can assist in adding support, see #18446 as a very similar case