gitmylo / audio-webui

A webui for different audio related Neural Networks
MIT License
964 stars 90 forks source link

[FEATURE REQUEST] Image2SFX #209

Open Merzmensch opened 5 months ago

Merzmensch commented 5 months ago

Would it be possible to implement Image2SFX (https://huggingface.co/spaces/fffiloni/Image2SFX-comparison)? Especially with the possibility of comparing different models. Probably even have a multiple-choice UX where you can select the models you would like to use.

Thank you!

d8ahazard commented 3 months ago

So, basically, all this is doing is using some kosmos API to do a caption of the image, and then feeding that to one of the audiogen models.

As such, this would feel like a great opportunity for an extension to be created that leverages one/more LLMs to create the caption...similar to my smartprocess extension for Auto1111.

Load the image, pick a LLM to do the captioning, feed it into one of the musicGen models...