huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
135.03k stars 27.02k forks source link

Add Visual Question Answering (VQA) pipeline #17208

Closed NielsRogge closed 2 years ago

NielsRogge commented 2 years ago

Feature request

We currently have ViLT in the library, which, among other tasks, is capable of performing visual question answering (VQA).

It would be great to have a pipeline for this task, with the following API:

from transformers import pipeline

pipe = pipeline("vqa")
pipe("cats.png", "how many cats are there?")

This pipeline could default to the https://huggingface.co/dandelin/vilt-b32-finetuned-vqa checkpoint. Also check out the Space that showcases the model.

This can be implemented similar to other pipelines. For an example PR that added a pipeline, see #11598.

Motivation

A pipeline is required in order to have inference widgets + a task defined at hf.co/tasks.

Moreover, it would be great to do VQA in two lines of code.

Your contribution

I can definitely assist in this, together with @Narsil, who's the pipeline expert.

Narsil commented 2 years ago

Tagging @mishig25 for the widget

LysandreJik commented 2 years ago

Also LXMERT should handle this task, but likely has a very different API.

mishig25 commented 2 years ago

This sounds amazing. Happy to contribute in anyway I can

sijunhe commented 2 years ago

I'd love to pick this up!

sabarish-srinivasan commented 2 years ago

Hey @sijunhe, I'm just starting out in open-source, but I'd like to help out however I can!

sijunhe commented 2 years ago

@sabarish-srinivasan appreciate the help but I saw this a little late and I am almost done with the PR.

sabarish-srinivasan commented 2 years ago

@sijunhe No problem, thanks for letting me know!

sijunhe commented 2 years ago

@LysandreJik I looked at both ViLT and LXMERT and I don't think it's possible to combine these two into a single pipeline for the following reasons:

  1. ViLT formats VQA as a classification task and LXMERT formats VQA as a squad-like QA task. It'd be hard to write a common post-processing
  2. ViLT is self-contained within transformers but LXMERT expects some faster-RCNN model to generate the visual features that goes into the model.
NielsRogge commented 2 years ago

Yes, don't think we should support LXMERT for the pipeline, since it isn't entirely included in the Transformers library.

LysandreJik commented 2 years ago

Sounds good, let's go with ViLT then!

sijunhe commented 2 years ago

Now that #17286 is merged, this issue should be closed now?

LysandreJik commented 2 years ago

Yes :) Thank you for your contribution @sijunhe!