Open Merzmensch opened 5 months ago
So, basically, all this is doing is using some kosmos API to do a caption of the image, and then feeding that to one of the audiogen models.
As such, this would feel like a great opportunity for an extension to be created that leverages one/more LLMs to create the caption...similar to my smartprocess extension for Auto1111.
Load the image, pick a LLM to do the captioning, feed it into one of the musicGen models...
Would it be possible to implement Image2SFX (https://huggingface.co/spaces/fffiloni/Image2SFX-comparison)? Especially with the possibility of comparing different models. Probably even have a multiple-choice UX where you can select the models you would like to use.
Thank you!