Open elena-soare20 opened 11 months ago
Hi @elena-soare20, thanks for raising this issue!
Yes, at the moment InstructBLIP isn't compatible with the pipeline because of the specific processing it does - which is different from many other models. Specifically, it has two tokenizers to create qformer_input_ids
and input_ids
to be passed to the model. There's some ongoing work to unify our processors so that hopefully more models like these can be quickly integrated.
Happy to review any PRs for anyone in the community who would like to enable this. See also: #21110
hey @amyeroberts I would be happy to work on this
@nakranivaibhav Awesome! Feel free to ping me for review when you have a PR ready 🤗
@amyeroberts Give me some time on this. The models are very large to reproduce the error. I am figuring out where to reproduce the error to start working on it.
@nakranivaibhav If all you need is a model to test functionality i.e. a randomly initialized model that outputs nonsense is fine, then the small model used during tests might help here. The config to build the model and test inputs can be found here.
@amyeroberts Yes that i what I need. Thank you for pointing it out.
System Info
transformers
version: 4.36.0.dev0Who can help?
@Narsil @amyeroberts
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
processor = InstructBlipProcessor.from_pretrained("Salesforce/instructblip-flan-t5-xl") pipe = pipeline("image-to-text", model="Salesforce/instructblip-flan-t5-xl", processor=processor.image_processor, tokenizer=processor.tokenizer, device=0) prompt = "describe te following image" url = "http://images.cocodataset.org/val2017/000000039769.jpg" image = Image.open(requests.get(url, stream=True).raw)
pipe(images=image, prompt=prompt)
Expected behavior
returns a textual description of the image. Instead, I get an error:
TypeError: ones_like(): argument 'input' (position 1) must be Tensor, not NoneType
I suspect this is caused by the
ImageToTextPipeline.preprocess()
, where we should ave custom behaviour for InstructBlip models to process the image and text in one go:inputs = processor(images=image, text=prompt, return_tensors="pt")