irthomasthomas / undecidability

13 stars 2 forks source link

TextDiffuser-2: Unleashing the Power of Language Models for Text Rendering #704

Open irthomasthomas opened 9 months ago

irthomasthomas commented 9 months ago

TITLE: unilm/textdiffuser-2/README.md at master · microsoft/unilm

DESCRIPTION:

"# TextDiffuser-2: Unleashing the Power of Language Models for Text Rendering

Arxiv Code Homepage Hugging Face Spaces Discord Invitation Replicate

TextDiffuser-2 exhibits enhanced capability powered by language models. In addition to generating text with remarkable accuracy, TextDiffuser-2 provides plausible text layouts and demonstrates a diverse range of text styles.

:star2: Highlights

:stopwatch: News

:hammer_and_wrench: Installation

Clone this repo:

git clone https://github.com/microsoft/unilm/
cd unilm/tree/master/textdiffuser-2

Build up a new environment and install packages as follows:

conda create -n textdiffuser2 python=3.8
conda activate textdiffuser2
pip install -r requirements.txt

Meanwhile, please install torch, torchvision, xformers that matches the version of the system and cuda version (refer to this link). Please also install flash-attention if you want to train the layout planner using FastChat. We provide the list of packages used in the experiments at link for your reference.

For training the text inpainting task, please install the diffusers package using the command pip install https://github.com/JingyeChen/diffusers_td2.git. Note that the U-Net architecture has been modified for receiving more input features.

If you encounter an error of RuntimeError: expected scalar type float Float but found Half triggered by diffusers/models/attention_processor.py, please use attention_processor.py to replace the corresponding file in the installed diffusers library.

:floppy_disk: Checkpoint

We upload the checkpoints to HuggingFace🤗.

Note that we provide the checkpoint with context length 77 as it performs better results when rendering general objects.

:books: Dataset

The data for training the layout planner is at link.

We employ the MARIO-10M dataset for training TextDiffuser-2. Please follow the Dataset section at TextDiffuser to download the dataset, including the train_dataset_index_file.

The train_dataset_index_file should be a .txt file, and each line should indicate an index of a training sample.

06269_062690093
27197_271975251
27197_271978467
...

:steam_locomotive: Train

Train layout planner

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 torchrun --nproc_per_node=8 --master_port=20003 fastchat/train/train_mem.py \
    --model_name_or_path lmsys/vicuna-7b-v1.5  \
    --data_path data/layout_planner_data_5k.json \
    --bf16 True \
    --output_dir experiment_result \
    --num_train_epochs 6 \
    --per_device_train_batch_size 2 \
    --per_device_eval_batch_size 2 \
    --gradient_accumulation_steps 16 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 500 \
    --save_total_limit 5 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --fsdp "full_shard auto_wrap" \
    --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \
    --tf32 True \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --lazy_preprocess True

It is normal that the loss curve seems like a staircase: Loss Curve

Train diffusion model

For full-parameter training:

accelerate launch train_textdiffuser2_t2i_full.py \
    --pretrained_model_name_or_path="runwayml/stable-diffusion-v1-5" \
    --train_batch_size=18 \
    --gradient_accumulation_steps=4 \
    --gradient_checkpointing \
    --mixed_precision="fp16" \
    --num_train_epochs=6 \
    --learning_rate=1e-5 \
    --max_grad_norm=1 \
    --lr_scheduler="constant" \
    --lr_warmup_steps=0 \
    --output_dir="diffusion_experiment_result" \
    --enable_xformers_memory_efficient_attention \
    --dataloader_num_workers=8 \
    --index_file_path='/path/to/train_dataset_index.txt' \
    --dataset_path='/path/to/laion-ocr-select/' \
    --granularity=128 \
    --coord_mode="ltrb" \
    --max_length=77 \
    --resume_from_checkpoint="latest"

For LoRA training:

accelerate launch train_textdiffuser2_t2i_lora.py \
    --pretrained_model_name_or_path="runwayml/stable-diffusion-v1-5" \
    --train_batch_size=18 \
    --gradient_accumulation_steps=4 \
    --gradient_checkpointing \
    --mixed_precision="fp16" \
    --num_train_epochs=6 \
    --learning_rate=1e-4 \
    --text_encoder_learning_rate=1e-5 \
    --lr_scheduler="constant" \
    --output_dir="diffusion_experiment_result" \
    --enable_xformers_memory_efficient_attention \
    --dataloader_num_workers=8 \
    --index_file_path='/path/to/train_dataset_index.txt' \
    --dataset_path='/path/to/laion-ocr-select/' \
    --granularity=128 \
    --coord_mode="ltrb" \
    --max_length=77 \
    --resume_from_checkpoint="latest"

If you encounter an "out-of-memory" error, please consider reducing the batch size appropriately.

:firecracker: Inference

For full-parameter inference:

accelerate launch inference_textdiffuser2_t2i_full.py \
  --pretrained_model_name_or_path="runwayml/stable-diffusion-v1-5" \
  --mixed_precision="fp16" \
  --output_dir="inference_results" \
  --enable_xformers_memory_efficient_attention \
  --resume_from_checkpoint="JingyeChen22/textdiffuser2-full-ft" \
  --granularity=128 \
  --max_length=77 \
  --coord_mode="ltrb" \
  --cfg=7.5 \
  --sample_steps=20 \
  --seed=43555 \
  --m1_model_path="JingyeChen22/textdiffuser2_layout_planner" \
  --input_format='prompt' \
  --input_prompt='a hotdog with mustard and other toppings on it'

For LoRA inference:

accelerate launch inference_textdiffuser2_t2i_lora.py \
  --pretrained_model_name_or_path="runwayml/stable-diffusion-v1-5" \
  --gradient_accumulation_steps=4 \
  --gradient_checkpointing \
  --mixed_precision="fp16" \
  --output_dir="inference_results" \
  --enable_xformers_memory_efficient_attention \
  --resume_from_checkpoint="JingyeChen22/textdiffuser2-lora-ft" \
  --granularity=128 \
  --coord_mode="ltrb" \
  --cfg=7.5 \
  --sample_steps=50 \
  --seed=43555 \
  --m1_model_path="JingyeChen22/textdiffuser2_layout_planner" \
  --input_format='prompt' \
  --input_prompt='a stamp of u.s.a'

:joystick: Demo

TextDiffuser-2 has been deployed on Hugging Face. Welcome to play with it! You can also run python gradio_demo.py to use the demo locally.

Demo

:love_letter: Acknowledgement

We sincerely thank AK and hysts for helping set up the demo. We also feel thankful for the available code/api/demo of SDXL, PixArt, Ideogram, DALLE-3, and GlyphControl.

:exclamation: Disclaimer

Please note that the code is intended for academic and research purposes ONLY. Any use of the code for generating inappropriate content is strictly prohibited. The responsibility for any misuse or inappropriate use of the code lies solely with the users who generated such content, and this code shall not be held liable for any such use.

:envelope: Contact

For help or issues using TextDiffuser-2, please email Jingye Chen (qwerty.chen@connect.ust.hk), Yupan Huang (huangyp28@mail2.sysu.edu.cn) or submit a GitHub issue.

For other communications related to TextDiffuser-2, please contact Lei Cui (lecu@microsoft.com) or Furu Wei (fuwei@microsoft.com).

:herb: Citation

If you find TextDiffuser-2 useful in your research, please consider citing:


@article{chen2023textdiffuser,
  title={TextDiffuser-2: Unleashing the Power of Language Models for Text Rendering},
  author={Chen, Jingye and Huang, Yupan and Lv, Tengchao and Cui, Lei and Chen, Qifeng and Wei, Furu},
  journal={arXiv preprint arXiv:2311.16465},
  year={2023}
}
```"

URL: [GitHub Repository](https://github.com/microsoft/unilm/blob/master/textdiffuser-2/README.md?plain=1)

#### Suggested labels
#### 
irthomasthomas commented 9 months ago

Related content

625 - Similarity score: 0.89

706 - Similarity score: 0.87

715 - Similarity score: 0.87

552 - Similarity score: 0.87

627 - Similarity score: 0.86

499 - Similarity score: 0.86