swissspidy / media-experiments

WordPress media experiments
GNU General Public License v2.0
69 stars 1 forks source link

Alt text generation using AI #350

Open Illyism opened 5 months ago

erikyo commented 5 months ago

I'm curious. in what way? is a good idea, but how would you implement it?

erikyo commented 5 months ago

I tried mobileNet and I must say that although the result is impressive for what little it weighs, unfortunately it is far from what is necessary to achieve.

Maybe it also need to implement a kind of training with the images already in the WordPress library, but it assumes they are there and are (well) tagged.

Illyism commented 5 months ago

Fair - I think it's fairly difficult without an external or bigger API

erikyo commented 5 months ago

if I have to tell you the truth I was amazed how mobileNet with a few 2-3MB can do this: https://storage.googleapis.com/tfjs-examples/mobilenet/dist/index.html

maybe with a custom model and some "homemade" training it's not so impossible

swissspidy commented 5 months ago

Automatic alt text generation has been on my to-do list for a while, it's even in the readme:

https://github.com/swissspidy/media-experiments/tree/29c6d473d149a5cb08c379ac94cb779a96ce13e9#alt-text-generation

I think it would be really cool to have a simple on-device implementation for such a feature, even just to demonstrate that it's possible.

As for custom models, the WP photo directory or Openverse would make for excellent sources for training data. As @erikyo mentioned, sites could also train models on their own media library for example, which would be really cool also from a privacy perspective as the model would never leave their site.

If someone wants something more powerful, they can always use an external service. The same goes for the video captioning and the like.

erikyo commented 5 months ago

I've done some research in this direction, and generating "interesting" captions for photos through a homebrew method is not impossible, but it can be highly resource-consuming and may result in lower quality compared to other models. This is especially true if the desired outcome is a diverse set of descriptions for images in a media library (why we care for seo mainly).

I came across an interesting approach that I'd like to mention. To address the challenge of generating varied image descriptions, you can create an API using the blip2-opt-2.7b model. I successfully implemented this by following a guide found here, and I made some additional modifications to meet specific requirements, such as the ability to add a prompt. The result of this implementation can be accessed at the following link:

blip-api by erikyo

In the first input you have to put the url of an image and then press submit, it takes about 60 seconds and then returns the description for it, the result seems to me generally very good but, as @swissspidy was pointing out, it is a method that requires something external (because the model weight 15gb and you need a lot of computing power to run it)

swissspidy commented 2 months ago

https://developers.googleblog.com/en/gemma-family-and-toolkit-expansion-io-2024/ looks veeery promising

image