Closed Illyism closed 3 months ago
I tried mobileNet and I must say that although the result is impressive for what little it weighs, unfortunately it is far from what is necessary to achieve.
Maybe it also need to implement a kind of training with the images already in the WordPress library, but it assumes they are there and are (well) tagged.
Fair - I think it's fairly difficult without an external or bigger API
if I have to tell you the truth I was amazed how mobileNet with a few 2-3MB can do this: https://storage.googleapis.com/tfjs-examples/mobilenet/dist/index.html
maybe with a custom model and some "homemade" training it's not so impossible
Automatic alt text generation has been on my to-do list for a while, it's even in the readme:
I think it would be really cool to have a simple on-device implementation for such a feature, even just to demonstrate that it's possible.
As for custom models, the WP photo directory or Openverse would make for excellent sources for training data. As @erikyo mentioned, sites could also train models on their own media library for example, which would be really cool also from a privacy perspective as the model would never leave their site.
If someone wants something more powerful, they can always use an external service. The same goes for the video captioning and the like.
I've done some research in this direction, and generating "interesting" captions for photos through a homebrew method is not impossible, but it can be highly resource-consuming and may result in lower quality compared to other models. This is especially true if the desired outcome is a diverse set of descriptions for images in a media library (why we care for seo mainly).
I came across an interesting approach that I'd like to mention. To address the challenge of generating varied image descriptions, you can create an API using the blip2-opt-2.7b model. I successfully implemented this by following a guide found here, and I made some additional modifications to meet specific requirements, such as the ability to add a prompt. The result of this implementation can be accessed at the following link:
In the first input you have to put the url
of an image and then press submit
, it takes about 60 seconds and then returns the description for it, the result seems to me generally very good but, as @swissspidy was pointing out, it is a method that requires something external (because the model weight 15gb and you need a lot of computing power to run it)
https://developers.googleblog.com/en/gemma-family-and-toolkit-expansion-io-2024/ looks veeery promising
I got this working now at https://github.com/swissspidy/ai-experiments using Transformers.js but only when I disable cross-origin isolation 😫 Need to figure out how to resolve it.
I'm curious. in what way? is a good idea, but how would you implement it?