[🌟 New Model] ConsistencyTTA: Accelerating Diffusion-Based Text-to-Audio Generation with Consistency Distillation

Bai-YT commented 5 months ago

Model/Pipeline/Scheduler description

ConsistencyTTA, introduced in the paper Accelerating Diffusion-Based Text-to-Audio Generation with Consistency Distillation, is an efficient text-to-audio generation model. Compared to a comparable diffusion-based TTA model, ConsistencyTTA achieves a 400x generation speed-up, while retaining the generation quality and diversity.

Due to its high generation quality and fast inference, we believe integrating this model into diffusers will make diffusers more appealing to text-to-audio generation researchers and users! Thank you very much.

Open source status

[X] The model implementation is available.
[X] The model weights are available (Only relevant if addition is not a scheduler).

Provide useful links for the implementation

The open-source code implementation can be found at https://github.com/Bai-YT/ConsistencyTTA.

There is also a simplified implementation for inference only: https://github.com/Bai-YT/ConsistencyTTA/tree/main/easy_inference.

The model checkpoints can be found at https://huggingface.co/Bai-YT/ConsistencyTTA.

I am the main author of the code, and am more than happy to assist the integration.

sayakpaul commented 5 months ago

@sanchit-gandhi @Vaibhavs10 FYI.

a-r-r-o-w commented 5 months ago

@Bai-YT Thank you for your awesome work! I just finished understanding the paper and think that I have a good grasp of the modeling and inference code to convert to diffusers.

@sayakpaul Could I pick this up if no one's working on it?

sayakpaul commented 5 months ago

Yeah for sure.

yiyixuxu commented 5 months ago

@a-r-r-o-w cool! but let's put it in community folder to start with

a-r-r-o-w commented 5 months ago

Sure, sounds good.

Bai-YT commented 5 months ago

@Bai-YT Thank you for your awesome work! I just finished understanding the paper and think that I have a good grasp of the modeling and inference code to convert to diffusers.

@sayakpaul Could I pick this up if no one's working on it?

Appreciate everyone's time for helping!!! Massive thanks.

github-actions[bot] commented 2 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Bai-YT commented 1 month ago

Hi everyone,

Thank you for the effort in adding ConsistencyTTA into diffusers! I just hoped to kindly check in to see if there has been any update. If there's anything I can help, please feel free to let me know!

Sincerely, Yatong

a-r-r-o-w commented 1 month ago

Hi @Bai-YT, thanks for your awesome work! We do have a PR open here, but we also had different plans on how to support it (relevant discussion in the PR). The pipeline works and one can run inference, but I haven't found the time to implement what was discussed in the PR yet. I will try giving it a shot in the near future.

github-actions[bot] commented 3 weeks ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

a-r-r-o-w commented 3 weeks ago

@sayakpaul Not stale. At some point in Diffusers community scripts compatibility, it would be nice to make it so that modeling + pipeline code in a single file works as expected. This is currently not supported (my PR uses different file for modeling which is on the Hub, and different file for pipeline which is in Diffusers community folder but YiYi mentioned her concerns with this approach so we didn't proceed with it)

Bai-YT commented 3 weeks ago

Hi @Bai-YT, thanks for your awesome work! We do have a PR open here, but we also had different plans on how to support it (relevant discussion in the PR). The pipeline works and one can run inference, but I haven't found the time to implement what was discussed in the PR yet. I will try giving it a shot in the near future.

Hi Aryan, sorry I just saw the message. Thank you very much for handling this!

I took a look at the PR and it looks awesome! From an algorithmic perspective, I just wanted to mention two things:

When ConsistencyTTA is used as a one-step/few-step model (which is what it is designed to do), setting the conventional CFG guidance_scale to a number other then 1 will likely not perform very well, and CFG is instead handled by the model internally with guidance_scale_cond, which should be much more powerful and perform much better.
The model was not trained/distilled with negative prompts, so I'm not sure how it will perform/behave with them.

I absolutely understand that these options are for compatibility with other models in the API, and it's very nice to have them here. Not requesting for code changes at all, but perhaps it might worth mentioning these in the documentation so that the users can have an idea.

I wish you a nice rest of the day!

huggingface / diffusers