microsoft / unilm

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities
https://aka.ms/GeneralAI
MIT License
20.21k stars 2.55k forks source link

quality of text diffuser-2 #1416

Open lwb2099 opened 10 months ago

lwb2099 commented 10 months ago

Describe Model I am using (Text diffuser-2): I am running inference on text diffuser-2 , the inference code of mine: CUDA_VISIBLE_DEVICES=6 python inference_textdiffuser2_t2i_full.py \ --pretrained_model_name_or_path="/path/to/stable-diffusion-v1-5" \ --mixed_precision="fp16" \ --enable_xformers_memory_efficient_attention \ --resume_from_checkpoint="/path/to/textdiffuser-2" \ --granularity=128 \ --max_length=77 \ --coord_mode="ltrb" \ --cfg=7.5 \ --sample_steps=20 \ --seed=43555 \ --vis_num 16 \ --m1_model_path="/path/to/layout_planner" \ --input_format='prompt' \ --input_prompt 'A picture of a bruised apple with the text apples are good for you' \ --output_dir="." and the generated results looks like this: 0 Looks like something is going wrong. Further test on some data from MarioEval:

image image
JingyeChen commented 10 months ago

Did you specify the path of the checkpoints?

haberchr commented 10 months ago

@JingyeChen Are there any checkpoints of the TextDiffuser-2 models available based on SD 2.1? If not, are there significant modifications to the code required to support the higher-resolution SD model? And, if so, is the training code to support SD 2.1 training released?

IngLP commented 8 months ago

I would love to test a checkpoint based on SD 2.1 too. The paper already mention that results basing in SD 2.1 are better.

5RJ commented 5 months ago

I meet the similar results, this is my running code, and results as follows: export CUDA_VISIBLE_DEVICES=4 accelerate launch inference_textdiffuser2_t2i_full.py \ --pretrained_model_name_or_path="/home/jovyan/wrj/workspace/project/tools/stable-diffusion-v1-5" \ --mixed_precision="fp16" \ --output_dir="inference_results" \ --enable_xformers_memory_efficient_attention \ --resume_from_checkpoint="/home/jovyan/wrj/workspace/project/unilm/textdiffuser-2/ckpt/JingyeChen22/textdiffuser2-full-ft" \ --granularity=128 \ --max_length=77 \ --coord_mode="ltrb" \ --cfg=7.5 \ --sample_steps=20 \ --seed=43555 \ --m1_model_path="/home/jovyan/wrj/workspace/project/unilm/textdiffuser-2/ckpt/JingyeChen22/textdiffuser2_layout_planner" \ --input_format='prompt' \ --input_prompt='the log for "ABC"' image Does it work normally?

5RJ commented 5 months ago

I meet the similar results, this is my running code, and results as follows: export CUDA_VISIBLE_DEVICES=4 accelerate launch inference_textdiffuser2_t2i_full.py --pretrained_model_name_or_path="/home/jovyan/wrj/workspace/project/tools/stable-diffusion-v1-5" --mixed_precision="fp16" --output_dir="inference_results" --enable_xformers_memory_efficient_attention --resume_from_checkpoint="/home/jovyan/wrj/workspace/project/unilm/textdiffuser-2/ckpt/JingyeChen22/textdiffuser2-full-ft" --granularity=128 --max_length=77 --coord_mode="ltrb" --cfg=7.5 --sample_steps=20 --seed=43555 --m1_model_path="/home/jovyan/wrj/workspace/project/unilm/textdiffuser-2/ckpt/JingyeChen22/textdiffuser2_layout_planner" --input_format='prompt' --input_prompt='the log for "ABC"' image Does it work normally?

And the log is as follows: (textdiffuser2) jovyan@nb-big-dz-mxfw-1-0:~/wrj/workspace/project/unilm/textdiffuser-2$ bash inference_textdiffuser2_t2i_full.sh /opt/conda/envs/textdiffuser2/lib/python3.8/site-packages/diffusers/utils/outputs.py:63: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead. torch.utils._pytree._register_pytree_node( GPU name: NVIDIA A100 80GB PCIe Number of GPUs: 1 Namespace(cache_dir=None, cfg=7.5, checkpointing_steps=500, checkpoints_total_limit=5, coord_mode='ltrb', dataloader_num_workers=0, drop_caption=False, enable_xformers_memory_efficient_attention=True, granularity=128, hub_model_id=None, hub_token=None, input_file=None, input_format='prompt', input_prompt='the log for "ABC"', local_rank=-1, logging_dir='logs', m1_model_path='/home/jovyan/wrj/workspace/project/unilm/textdiffuser-2/ckpt/JingyeChen22/textdiffuser2_layout_planner', max_length=77, mixed_precision='fp16', output_dir='inference_results', pretrained_model_name_or_path='/home/jovyan/wrj/workspace/project/tools/stable-diffusion-v1-5', prompts_txt_file=None, push_to_hub=False, report_to='tensorboard', resolution=512, resume_from_checkpoint='/home/jovyan/wrj/workspace/project/unilm/textdiffuser-2/ckpt/JingyeChen22/textdiffuser2-full-ft', revision=None, sample_steps=20, seed=43555, vis_num=16) /opt/conda/envs/textdiffuser2/lib/python3.8/site-packages/accelerate/accelerator.py:401: UserWarning: log_with=tensorboard was passed but no supported trackers are currently installed. warnings.warn(f"log_with={log_with} was passed but no supported trackers are currently installed.") Detected kernel version 5.4.160, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher. 05/31/2024 09:18:32 - INFO - main - Distributed environment: NO Num processes: 1 Process index: 0 Local process index: 0 Device: cuda

Mixed precision type: fp16


49408 51583


{'scaling_factor', 'force_upcast'} was not found in config. Values will be initialized to default values. {'addition_time_embed_dim', 'reverse_transformer_layers_per_block', 'transformer_layers_per_block', 'dropout', 'attention_type'} was not found in config. Values will be initialized to default values. Resuming from checkpoint textdiffuser2-full-ft 05/31/2024 09:18:46 - INFO - accelerate.accelerator - Loading states from /home/jovyan/wrj/workspace/project/unilm/textdiffuser-2/ckpt/JingyeChen22/textdiffuser2-full-ft 05/31/2024 09:18:46 - INFO - accelerate.checkpointing - All model weights loaded successfully 05/31/2024 09:18:46 - INFO - accelerate.checkpointing - All optimizer states loaded successfully 05/31/2024 09:18:46 - INFO - accelerate.checkpointing - All scheduler states loaded successfully 05/31/2024 09:18:46 - INFO - accelerate.checkpointing - All dataloader sampler states loaded successfully 05/31/2024 09:18:46 - INFO - accelerate.checkpointing - GradScaler state loaded successfully 05/31/2024 09:18:46 - INFO - accelerate.checkpointing - Could not load random states 05/31/2024 09:18:46 - INFO - accelerate.accelerator - Loading in 0 custom states detect existing output_dir, removing the contained jpg/txt files ... rm: cannot remove 'inference_results/.jpg': No such file or directory rm: cannot remove 'inference_results/.txt': No such file or directory Loading checkpoint shards: 100%|██████████████████| 3/3 [00:25<00:00, 8.50s/it] there are 1 samples for generation [Human] Given a prompt that will be used to generate an image, plan the layout of visual text for the image. The size of the image is 128x128. Therefore, all properties of the positions should not exceed 128, including the coordinates of top, left, right, and bottom. All keywords are included in the caption. You dont need to specify the details of font styles. At each line, the format should be keyword left, top, right, bottom. So let us begin. Prompt: the log for "ABC" [Assistant] ABC 22,24,114,79

the number of samples: 1 user_prompt the log for "ABC" current_ocr ['ABC 22,24,114,79', ''] /opt/conda/envs/textdiffuser2/lib/python3.8/site-packages/diffusers/utils/outputs.py:63: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead. torch.utils._pytree._register_pytree_node( {'clip_sample_range', 'prediction_type', 'timestep_spacing', 'thresholding', 'variance_type', 'dynamic_thresholding_ratio', 'sample_max_value'} was not found in config. Values will be initialized to default values. 100%|███████████████████████████████████████████| 20/20 [00:07<00:00, 2.76it/s]