teticio / audio-diffusion

Apply diffusion models using the new Hugging Face diffusers package to synthesize music instead of images.
GNU General Public License v3.0
707 stars 69 forks source link

Music generation conditioned on text and music #46

Closed favazmuhammed closed 8 months ago

favazmuhammed commented 8 months ago

Is it possible to generate music conditioned on both text and music. Issue running conditional_generation.ipynb: while loading "audio_diffusion = AudioDiffusion(model_id="teticio/conditional-latent-audio-diffusion-512")" I'm getting error as "ValueError: mel/audio_diffusion.py as defined in model_index.json does not exist in teticio/conditional-latent-audio-diffusion-512 and is not a module in 'diffusers/pipelines'"

teticio commented 8 months ago

Oh thanks for pointing this out! My config was still pointing to the Mel class in audiodiffusion from before the migration to diffusers. I presume a recent change in diffusers has restricted the modules to huggingface packages. I have fixed it.

teticio commented 8 months ago

Ah, and regarding your question, yes, it is possible. You would have to train a model yourself. I purposefully made the "encodings" generic: they are just a bunch of numbers associated with an audio. I gave an example of how you could use AudioEncoder but you could just as easily have a text associated with each audio which you then pass through some model (e.g., SentenceTransformers) to give you an embedding (bunch of numbers) that you use for conditional generation.