Spike: Investigate AI-powered alt tags

tomusher commented 8 months ago

We have discussed bringing computer-vision powered image alt-tagging/captioning in to Wagtail AI.

Questions:

Is this package the right place for that?
Does https://github.com/marteinn/wagtail-alt-generator fill that need?
What AI APIs will we need to implement to support this?

tm-kn commented 8 months ago

I assume there's no overlap with the Wagtail Vector Index on this one. Is this just uploading an image to the cloud service and getting an alt-tag for an individual image?

tomusher commented 8 months ago

@tm-kn Right, while there may be a need to generate an embedding of an image for captioning in some cases/for local models, we're more likely to be using existing computer vision APIs for this which will generally take the image directly

tm-kn commented 8 months ago

Conventional services:

There are other services like this that are outside of the cloud ecosystem, but those three looks most popular.

It does appear that they all only work in English. https://github.com/marteinn/wagtail-alt-generator also implements a call to Google Translate. I don't know if we want to include that in the scope of this.

We could also consider using ChatGPT 4 with vision (https://platform.openai.com/docs/api-reference/chat/create). I assume we could then also ask ChatGPT to provide description in any language we like, or possibly even allow administrators to manage the prompt.

from openai import OpenAI

client = OpenAI()

response = client.chat.completions.create(
    model="gpt-4-vision-preview",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What’s in this image?"},
                {
                    "type": "image_url",
                    "image_url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg",
                },
            ],
        }
    ],
    max_tokens=300,
)

print(response.choices[0])

Another consideration here is the UI. I think we could implement a button in JavaScript that can be run on request rather than making a request in a signal handler each time an image is uploaded. Feels like that could save money on API services and also make the image upload more resilient by not having a third-party API calls on upload.

llm still does not support images in prompts but will probably do: https://github.com/simonw/llm/issues/331 and https://github.com/simonw/llm/issues/325.

We could implement an OpenAI-specific image recognition backend for now.

tm-kn commented 7 months ago

I've done a quick test spike here: https://github.com/wagtail/wagtail-ai/pull/47.

It seems to work beautifully with ChatGPT 4.

There are some major issues here:

It doesn't seem that Wagtail image view can be customized to accommodate an asynchronous action. E.g. cannot add a button that makes a separate call to make up the title.
We're lumping API calls into image upload because signal handler seems the only way to do this without degrading user experience.

I'm unsure how we'd implement the UI for this.

Also, how do we configure prompts.

Something else to consider would be support for custom alt text fields, besides the title field.

Another blocker is establishing the technical architecture for how this is configured etc. I know that others may have some opinions about this so may be worth connecting first before proper implementation is done.

Do we want to wait for the llm package's support?

wagtail / wagtail-ai

Spike: Investigate AI-powered alt tags #32