AnyText: Multilingual Visual Text Generation And Editing

huggingface / diffusers

🤗 Diffusers: State-of-the-art diffusion models for image and audio generation in PyTorch and FLAX.

https://huggingface.co/docs/diffusers

Apache License 2.0

26.22k stars 5.4k forks source link

AnyText: Multilingual Visual Text Generation And Editing #6407

Open sayakpaul opened 10 months ago

sayakpaul commented 10 months ago

Model/Pipeline/Scheduler description

From the repository:

AnyText comprises a diffusion pipeline with two primary elements: an auxiliary latent module and a text embedding module. The former uses inputs like text glyph, position, and masked image to generate latent features for text generation or editing. The latter employs an OCR model for encoding stroke data as embeddings, which blend with image caption embeddings from the tokenizer to generate texts that seamlessly integrate with the background. We employed text-control diffusion loss and text perceptual loss for training to further enhance writing accuracy.

Open source status

[X] The model implementation is available.
[X] The model weights are available (Only relevant if addition is not a scheduler).

Provide useful links for the implementation

Repository: https://github.com/tyxsspa/AnyText

Paper: https://arxiv.org/abs/2311.03054

Weights and inference code: https://modelscope.cn/models/damo/cv_anytext_text_generation_editing/summary

sayakpaul commented 10 months ago

Yes, sure! Feel free to let us know in case of any help.

For starters, I think it might be better to add this to research_projects similar to ControlNetXS.

We might not be able to add to community because AnyText has modelling components.

Does this make sense? If we see enough usage, we can include it in the core.

a-r-r-o-w commented 10 months ago

Hi @coding-famer. Have you been able to make progress on this? I'd very much like to be able to use this with diffusers, and would like to help where I can. From the pipeline perspective, I understand most of the code and have made some significant progress. From the modelling perspective, I'm not too sure about what new additions need to be made as I'm still navigating the codebase.

This is a link to the converted AnyText model on huggingface, which might be of help. It took me a very long time (~18 hours) to download from the modelscope hub servers, which I assume are located in China. I'm hoping the conversion to diffusers format was correct. I'm still looking into it, and do not have a full idea, but it seems like there will be different weights used in the clip-encoder based on embedding type here: (but this ocr and vit only seem to be useful for text-editing, which could probably be done sometime in the future; for now, replicating the text-generation part would be great)

https://github.com/tyxsspa/AnyText/blob/cd8924720896462ad61e2adaf086b669340207e0/cldm/embedding_manager.py#L75

a-r-r-o-w commented 9 months ago

Hi, I'm still working on this. Happy to do it together.

Hey, sorry for the late response. I got caught up with other PRs and looking into other interesting work. Would Discord be okay for communication if you're still progressing on this?

github-actions[bot] commented 8 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

sayakpaul commented 8 months ago

Contributions are still welcome.

tuanh123789 commented 8 months ago

@sayakpaul can i work on this ?

sayakpaul commented 8 months ago

Sure, we can start with a community pipeline :)

github-actions[bot] commented 7 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

bghira commented 7 months ago

not stale

github-actions[bot] commented 6 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

bghira commented 6 months ago

not stale

tolgacangoz commented 5 months ago

Can I work on this community pipeline?

Edit 1: I have been busy for several weeks lately because of several personal issues. From now on, I am completely into this. Sorry for holding this pipeline so far.

Edit 2: I largely understood the pipeline. Now, I am trying to convert the checkpoint into diffusers' format. It has a ControlNet model and several other special components.

sayakpaul commented 5 months ago

Yes, you can. Thank you :)

github-actions[bot] commented 2 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.