Add visual-question-answering / multimodal support to gradio notebook tasks

Bedrovelsen commented 6 months ago

Enjoying the recent gradio notebook stuff!

Was curious about when/if support for an additional hugging face task option of "visual question answering“" is planned?

If not currently planning to add this could a quick overview on how to add a new task category to the gradio notebook codebase (beside just manually reading over the current code for gradio notebooks myself to figure it out on my own which I can do of course but guidance from the team is preferred for best practices in contributing etc)

saqadri commented 6 months ago

Thanks @Bedrovelsen! Would love your help adding that, and messages you on discord so our team can work with you to make sure you can get this set up!

Bedrovelsen commented 6 months ago

Sounds good

rholinshead commented 5 months ago

Just copying over the quick implementation overview from discord here:

A new HuggingFaceVisualQuestionAnsweringRemoteInference ModelParser under https://github.com/lastmile-ai/aiconfig/tree/main/extensions/HuggingFace/python/src/aiconfig_extension_hugging_face/remote_inference_client folder This parser should look pretty similar to the existing HuggingFaceImage2TextRemoteInference model parser, with the following changes:
- serialize implementation will do the same image/attachment data stuff but the constructed PromptInput will also need data string representing the 'question' string value from the data passed to serialize
- refine_completion_params implementation can be the same, but should have comment pointing to the visual_question_answering api code: https://github.com/huggingface/huggingface_hub/blob/main/src/huggingface_hub/inference/_client.py#L1785
- deserialize implementation can be mostly the same, except we will need to add 'question' to the completion_data from the prompt data: completion_data["question"] = prompt["data"]
- run implementation will be similar as well, just needs to call client.visual_question_answering with the completion_data and need to handle the response as desired. It looks like the response will be a list of VisualQuestionAnsweringOutputElement objects; we'll want to serialize those as ExecuteResult outputs in the format you think is best. For example, we could have data be the answer and then store the score in metadata

I believe the helpers about validating/retrieving the image from attachments can just be kept the same.

With the parser implemented, we can expose it in the extension here: https://github.com/lastmile-ai/aiconfig/blob/main/extensions/HuggingFace/python/src/aiconfig_extension_hugging_face/__init__.py

For testing the extension, please see README instructions - https://github.com/lastmile-ai/aiconfig/blob/main/extensions/HuggingFace/python/README.md

Then, I would recommend importing and registering the new parser in https://github.com/lastmile-ai/aiconfig/blob/main/cookbooks/Gradio/aiconfig_model_registry.py with id "Visual Question Answering" and then following the Getting Started instructions in https://github.com/lastmile-ai/aiconfig/edit/main/cookbooks/Gradio/README.md to open the huggingface.aiconfig.json file with the new parser registered.

On the UI side, we will need to add a new PromptSchema to the client for rendering the parser's input and settings nicely. I can implement that shortly

rholinshead commented 5 months ago

Whoops, linked #1396 which has the schema changes and it auto-closed. This issue is still open

lastmile-ai / aiconfig

Add visual-question-answering / multimodal support to gradio notebook tasks #1392