tyxsspa / AnyText

Official implementation code of the paper <AnyText: Multilingual Visual Text Generation And Editing>
Apache License 2.0
4.25k stars 280 forks source link

Performance drop compared to the demo #103

Closed myungkyuKoo closed 4 months ago

myungkyuKoo commented 4 months ago
image

When I ran it with the code on the repository and the 'anytext_v1.1.ckpt' file provided by ModelScope, the results are very different from the HuggingFace demo. What is the cause?

2024-06-08 17:15:16,028 - modelscope - INFO - PyTorch version 2.0.1 Found.                                                                                                                                                                            
2024-06-08 17:15:16,029 - modelscope - INFO - TensorFlow version 2.13.0 Found.                                                                                                                                                                        
2024-06-08 17:15:16,029 - modelscope - INFO - Loading ast index from /home/myungkyu/.cache/modelscope/ast_indexer                                                                                                                                     
2024-06-08 17:15:16,058 - modelscope - INFO - Loading done! Current index file version is 1.10.0, with md5 82882da0bb7031809a4831af2febb197 and a total number of 946 components indexed                                                              
2024-06-08 17:15:19,076 - modelscope - INFO - Use user-specified model revision: v1.1.2                                                                                                                                                               
Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 18.3k/18.3k [00:00<00:00, 412kB/s]                                                                                                  
Downloading: 100%|█████████████████████████████████████████████████████████████████████████████████████████████| 7.01k/7.01k [00:00<00:00, 22.4MB/s]                                                                                                  
2024-06-08 17:15:22,247 - modelscope - WARNING - ('PIPELINES', 'my-anytext-task', 'anytext-pipeline') not found in ast index file                                                                                                                     
ControlLDM: Running in eps-prediction mode                                                                                                                                                                                                            
Setting up MemoryEfficientCrossAttention. Query dim is 320, context_dim is None and using 8 heads.                                                                                                                                                    
Setting up MemoryEfficientCrossAttention. Query dim is 320, context_dim is 768 and using 8 heads.                                                                                                                                                     
Setting up MemoryEfficientCrossAttention. Query dim is 320, context_dim is None and using 8 heads.                                                                                                                                                    
Setting up MemoryEfficientCrossAttention. Query dim is 320, context_dim is 768 and using 8 heads.                                                                                                                                                     
Setting up MemoryEfficientCrossAttention. Query dim is 640, context_dim is None and using 8 heads.                                                                                                                                                    
Setting up MemoryEfficientCrossAttention. Query dim is 640, context_dim is 768 and using 8 heads.                                                                                                                                                     
Setting up MemoryEfficientCrossAttention. Query dim is 640, context_dim is None and using 8 heads.                                                                                                                                                    
Setting up MemoryEfficientCrossAttention. Query dim is 640, context_dim is 768 and using 8 heads.                                                                                                                                                     
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is None and using 8 heads.                                                                                                                                                   
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is 768 and using 8 heads.                                                                                                                                                    
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is None and using 8 heads.                                                                                                                                                   
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is 768 and using 8 heads.                                                                                                                                                    
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is None and using 8 heads.                                                                                                                                                   
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is 768 and using 8 heads.                                                                                                                                                    
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is None and using 8 heads.                                                                                                                                                   
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is 768 and using 8 heads.                                                                                                                                                    
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is None and using 8 heads.                                                                                                                                                   
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is 768 and using 8 heads.                                                                                                                                                    
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is None and using 8 heads.                                                                                                                                                   
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is 768 and using 8 heads.                                                                                                                                                    
Setting up MemoryEfficientCrossAttention. Query dim is 640, context_dim is None and using 8 heads.                                                                                                                                                    
Setting up MemoryEfficientCrossAttention. Query dim is 640, context_dim is 768 and using 8 heads.                                                                                                                                                     
Setting up MemoryEfficientCrossAttention. Query dim is 640, context_dim is None and using 8 heads.                                                                                                                                                    
Setting up MemoryEfficientCrossAttention. Query dim is 640, context_dim is 768 and using 8 heads.                                                                                                                                                     
Setting up MemoryEfficientCrossAttention. Query dim is 640, context_dim is None and using 8 heads.                                                                                                                                                    
Setting up MemoryEfficientCrossAttention. Query dim is 640, context_dim is 768 and using 8 heads.                                                                                                                                                     
Setting up MemoryEfficientCrossAttention. Query dim is 320, context_dim is None and using 8 heads.                                                                                                                                                    
Setting up MemoryEfficientCrossAttention. Query dim is 320, context_dim is 768 and using 8 heads.                                                                                                                                                     
Setting up MemoryEfficientCrossAttention. Query dim is 320, context_dim is None and using 8 heads.                                                                                                                                                    
Setting up MemoryEfficientCrossAttention. Query dim is 320, context_dim is 768 and using 8 heads.                                                                                                                                                     
Setting up MemoryEfficientCrossAttention. Query dim is 320, context_dim is None and using 8 heads.                                                                                                                                                    
Setting up MemoryEfficientCrossAttention. Query dim is 320, context_dim is 768 and using 8 heads.                                                                                                                                                     
DiffusionWrapper has 859.52 M params.                                                                                      
making attention of type 'vanilla-xformers' with 512 in_channels                                                           
building MemoryEfficientAttnBlock with 512 in_channels...                                                                  
Working with z of shape (1, 4, 32, 32) = 4096 dimensions.                                                                  
making attention of type 'vanilla-xformers' with 512 in_channels                                                           
building MemoryEfficientAttnBlock with 512 in_channels... 
Some weights of the model checkpoint at /home/myungkyu/.cache/modelscope/hub/damo/cv_anytext_text_generation_editing/clip-vit-large-patch14 were not used when initializing CLIPTextModel: ['vision_model.encoder.layers.15.layer_norm2.weight', 'visi
on_model.encoder.layers.8.mlp.fc2.bias', 'vision_model.encoder.layers.15.self_attn.q_proj.bias', 'vision_model.encoder.layers.11.layer_norm2.weight', 'vision_model.encoder.layers.10.self_attn.v_proj.bias', 'vision_model.encoder.layers.10.self_att
n.out_proj.weight', 'vision_model.encoder.layers.2.self_attn.q_proj.bias', 'vision_model.encoder.layers.20.self_attn.k_proj.weight', 'vision_model.encoder.layers.2.mlp.fc1.bias', 'vision_model.encoder.layers.19.mlp.fc1.bias', 'vision_model.encode
r.layers.23.mlp.fc2.bias', 'visual_projection.weight', 'vision_model.encoder.layers.17.self_attn.v_proj.bias', 'vision_model.encoder.layers.18.self_attn.q_proj.bias', 'vision_model.encoder.layers.18.mlp.fc2.bias', 'vision_model.encoder.layers.23.
mlp.fc1.weight', 'vision_model.encoder.layers.3.mlp.fc2.bias', 'vision_model.encoder.layers.0.mlp.fc1.bias', 'vision_model.encoder.layers.2.self_attn.v_proj.weight', 'vision_model.encoder.layers.13.self_attn.out_proj.weight', 'vision_model.embedd
ings.class_embedding', 'vision_model.encoder.layers.14.mlp.fc1.bias', 'vision_model.encoder.layers.6.self_attn.out_proj.weight', 'vision_model.encoder.layers.2.self_attn.out_proj.bias', 'vision_model.encoder.layers.16.self_attn.v_proj.weight', 'v
ision_model.encoder.layers.16.mlp.fc1.bias', 'vision_model.encoder.layers.15.self_attn.out_proj.weight', 'vision_model.encoder.layers.15.mlp.fc2.bias', 'vision_model.encoder.layers.22.mlp.fc1.bias', 'vision_model.encoder.layers.2.mlp.fc2.weight',
 'vision_model.encoder.layers.10.self_attn.v_proj.weight', 'vision_model.encoder.layers.5.self_attn.out_proj.weight', 'vision_model.encoder.layers.0.layer_norm2.bias', 'vision_model.encoder.layers.15.layer_norm1.weight', 'vision_model.encoder.lay
ers.20.self_attn.v_proj.weight', 'vision_model.encoder.layers.7.self_attn.q_proj.weight', 'vision_model.encoder.layers.13.self_attn.q_proj.bias', 'vision_model.encoder.layers.19.self_attn.out_proj.weight', 'vision_model.encoder.layers.20.self_att
n.v_proj.bias', 'vision_model.encoder.layers.14.self_attn.q_proj.weight', 'vision_model.encoder.layers.22.layer_norm2.bias', 'vision_model.encoder.layers.3.self_attn.v_proj.bias', 'vision_model.encoder.layers.16.mlp.fc2.bias', 'vision_model.encod
er.layers.4.layer_norm1.bias', 'vision_model.encoder.layers.10.layer_norm2.weight', 'vision_model.encoder.layers.19.self_attn.k_proj.weight', 'vision_model.encoder.layers.4.self_attn.v_proj.weight', 'vision_model.encoder.layers.15.mlp.fc1.bias', 
'vision_model.encoder.layers.3.layer_norm2.bias', 'vision_model.encoder.layers.18.self_attn.v_proj.weight', 'vision_model.encoder.layers.4.self_attn.q_proj.weight', 'vision_model.encoder.layers.6.self_attn.v_proj.bias', 'vision_model.pre_layrnorm
.weight', 'vision_model.encoder.layers.21.mlp.fc1.weight', 'vision_model.encoder.layers.21.layer_norm1.bias', 'vision_model.encoder.layers.20.layer_norm1.weight', 'vision_model.encoder.layers.4.layer_norm2.weight', 'vision_model.encoder.layers.17
.self_attn.q_proj.weight', 'vision_model.encoder.layers.12.self_attn.v_proj.bias', 'vision_model.encoder.layers.0.self_attn.v_proj.bias', 'vision_model.encoder.layers.3.self_attn.k_proj.weight', 'vision_model.encoder.layers.23.mlp.fc1.bias', 'vis
ion_model.encoder.layers.11.self_attn.v_proj.weight', 'vision_model.encoder.layers.1.self_attn.v_proj.weight', 'vision_model.encoder.layers.12.layer_norm1.bias', 'vision_model.encoder.layers.4.self_attn.out_proj.weight', 'vision_model.encoder.lay
ers.4.self_attn.q_proj.bias', 'vision_model.encoder.layers.8.self_attn.q_proj.bias', 'vision_model.encoder.layers.17.self_attn.q_proj.bias', 'vision_model.encoder.layers.23.layer_norm1.bias', 'vision_model.encoder.layers.18.self_attn.out_proj.wei
ght', 'vision_model.encoder.layers.14.layer_norm1.bias', 'vision_model.encoder.layers.21.layer_norm2.bias', 'vision_model.encoder.layers.11.self_attn.q_proj.bias', 'vision_model.encoder.layers.6.mlp.fc1.weight', 'vision_model.encoder.layers.1.mlp
.fc2.bias', 'vision_model.encoder.layers.15.mlp.fc2.weight', 'vision_model.encoder.layers.14.layer_norm1.weight', 'vision_model.encoder.layers.10.self_attn.q_proj.bias', 'vision_model.encoder.layers.15.self_attn.k_proj.weight', 'vision_model.enco
der.layers.9.self_attn.q_proj.weight', 'vision_model.encoder.layers.23.self_attn.k_proj.bias', 'vision_model.encoder.layers.15.self_attn.k_proj.bias', 'vision_model.encoder.layers.19.self_attn.q_proj.weight', 'vision_model.encoder.layers.9.mlp.fc
1.bias', 'vision_model.encoder.layers.8.self_attn.v_proj.weight', 'vision_model.encoder.layers.13.layer_norm2.bias', 'vision_model.encoder.layers.20.layer_norm2.bias', 'vision_model.pre_layrnorm.bias', 'vision_model.encoder.layers.0.self_attn.q_p
roj.bias', 'vision_model.encoder.layers.15.mlp.fc1.weight', 'vision_model.encoder.layers.3.mlp.fc1.weight', 'vision_model.encoder.layers.9.self_attn.k_proj.bias', 'vision_model.embeddings.patch_embedding.weight', 'vision_model.encoder.layers.0.ml
p.fc1.weight', 'vision_model.encoder.layers.15.self_attn.v_proj.weight', 'vision_model.encoder.layers.16.layer_norm2.bias', 'vision_model.encoder.layers.11.layer_norm1.bias', 'vision_model.encoder.layers.10.self_attn.k_proj.bias', 'vision_model.e
ncoder.layers.16.layer_norm1.weight', 'vision_model.encoder.layers.7.layer_norm2.bias', 'vision_model.encoder.layers.13.self_attn.q_proj.weight', 'vision_model.encoder.layers.1.self_attn.k_proj.weight', 'vision_model.encoder.layers.0.layer_norm2.
weight', 'vision_model.encoder.layers.12.mlp.fc1.bias', 'vision_model.encoder.layers.0.self_attn.out_proj.weight', 'vision_model.encoder.layers.7.mlp.fc1.weight', 'vision_model.encoder.layers.12.layer_norm2.weight', 'vision_model.encoder.layers.3
.layer_norm1.weight', 'vision_model.embeddings.position_embedding.weight', 'vision_model.encoder.layers.21.self_attn.out_proj.bias', 'vision_model.encoder.layers.23.layer_norm1.weight', 'vision_model.encoder.layers.20.self_attn.q_proj.weight', 'v
ision_model.encoder.layers.8.mlp.fc1.bias', 'vision_model.encoder.layers.23.self_attn.q_proj.bias', 'vision_model.encoder.layers.17.mlp.fc1.weight', 'vision_model.encoder.layers.21.self_attn.q_proj.weight', 'vision_model.encoder.layers.14.mlp.fc1
.weight', 'vision_model.encoder.layers.21.layer_norm1.weight', 'vision_model.encoder.layers.11.self_attn.k_proj.bias', 'vision_model.encoder.layers.13.self_attn.out_proj.bias', 'vision_model.encoder.layers.6.mlp.fc1.bias', 'vision_model.encoder.l
ayers.13.mlp.fc1.bias', 'vision_model.encoder.layers.1.mlp.fc2.weight', 'vision_model.encoder.layers.1.self_attn.k_proj.bias', 'vision_model.encoder.layers.7.self_attn.q_proj.bias', 'vision_model.encoder.layers.19.self_attn.v_proj.weight', 'visio
n_model.encoder.layers.12.mlp.fc2.bias', 'vision_model.encoder.layers.6.self_attn.q_proj.weight', 'vision_model.encoder.layers.22.self_attn.v_proj.bias', 'vision_model.encoder.layers.4.layer_norm1.weight', 'vision_model.encoder.layers.18.self_att
n.k_proj.bias', 'vision_model.encoder.layers.5.mlp.fc2.weight', 'vision_model.encoder.layers.17.self_attn.v_proj.weight', 'vision_model.encoder.layers.10.layer_norm2.bias', 'vision_model.encoder.layers.3.self_attn.q_proj.bias', 'vision_model.enco
der.layers.4.layer_norm2.bias', 'vision_model.encoder.layers.6.mlp.fc2.bias', 'vision_model.encoder.layers.23.layer_norm2.weight', 'vision_model.encoder.layers.3.self_attn.out_proj.weight', 'vision_model.encoder.layers.18.mlp.fc2.weight', 'vision
_model.encoder.layers.2.layer_norm2.weight', 'vision_model.encoder.layers.3.mlp.fc1.bias', 'vision_model.encoder.layers.7.layer_norm1.bias', 'vision_model.encoder.layers.7.self_attn.v_proj.weight', 'vision_model.encoder.layers.3.layer_norm1.bias'
, 'vision_model.encoder.layers.15.self_attn.v_proj.bias', 'vision_model.encoder.layers.23.self_attn.v_proj.weight', 'vision_model.encoder.layers.13.self_attn.k_proj.bias', 'vision_model.encoder.layers.8.layer_norm2.bias', 'vision_model.encoder.la
yers.0.self_attn.out_proj.bias', 'vision_model.encoder.layers.18.layer_norm1.bias', 'vision_model.encoder.layers.5.self_attn.q_proj.bias', 'vision_model.encoder.layers.6.layer_norm2.bias', 'vision_model.encoder.layers.19.self_attn.v_proj.bias', '
vision_model.encoder.layers.8.self_attn.out_proj.weight', 'vision_model.encoder.layers.17.mlp.fc1.bias', 'vision_model.encoder.layers.10.mlp.fc2.weight', 'vision_model.encoder.layers.19.self_attn.q_proj.bias', 'vision_model.encoder.layers.9.layer
_norm1.bias', 'vision_model.encoder.layers.0.self_attn.q_proj.weight', 'vision_model.encoder.layers.12.self_attn.out_proj.bias', 'vision_model.encoder.layers.16.mlp.fc2.weight', 'vision_model.encoder.layers.14.self_attn.v_proj.bias', 'vision_mode
l.encoder.layers.21.layer_norm2.weight', 'vision_model.encoder.layers.21.mlp.fc1.bias', 'vision_model.encoder.layers.2.self_attn.k_proj.weight', 'vision_model.encoder.layers.4.mlp.fc1.weight', 'vision_model.encoder.layers.9.layer_norm1.weight', '
vision_model.encoder.layers.21.self_attn.v_proj.weight', 'vision_model.encoder.layers.0.mlp.fc2.bias', 'vision_model.encoder.layers.14.mlp.fc2.weight', 'vision_model.encoder.layers.13.mlp.fc1.weight', 'vision_model.encoder.layers.12.self_attn.k_p
roj.bias', 'vision_model.encoder.layers.20.layer_norm1.bias', 'vision_model.encoder.layers.8.mlp.fc2.weight', 'text_projection.weight', 'vision_model.encoder.layers.19.layer_norm2.weight', 'vision_model.encoder.layers.14.self_attn.q_proj.bias', '
vision_model.encoder.layers.20.mlp.fc2.bias', 'vision_model.encoder.layers.5.self_attn.v_proj.weight', 'vision_model.encoder.layers.11.mlp.fc2.bias', 'vision_model.encoder.layers.6.self_attn.out_proj.bias', 'vision_model.encoder.layers.4.self_att
n.k_proj.bias', 'vision_model.encoder.layers.11.mlp.fc1.bias', 'vision_model.encoder.layers.17.layer_norm1.weight', 'vision_model.encoder.layers.11.self_attn.k_proj.weight', 'vision_model.encoder.layers.16.self_attn.v_proj.bias', 'vision_model.en
coder.layers.18.self_attn.k_proj.weight', 'vision_model.encoder.layers.1.layer_norm2.bias', 'vision_model.encoder.layers.16.mlp.fc1.weight', 'vision_model.encoder.layers.22.self_attn.q_proj.weight', 'vision_model.encoder.layers.2.layer_norm2.bias
', 'vision_model.encoder.layers.13.self_attn.k_proj.weight', 'vision_model.encoder.layers.6.self_attn.q_proj.bias', 'vision_model.encoder.layers.2.layer_norm1.weight', 'vision_model.encoder.layers.3.mlp.fc2.weight', 'vision_model.encoder.layers.7
.mlp.fc2.bias', 'vision_model.encoder.layers.10.mlp.fc2.bias', 'vision_model.encoder.layers.17.self_attn.out_proj.bias', 'vision_model.encoder.layers.17.self_attn.out_proj.weight', 'vision_model.encoder.layers.22.layer_norm1.weight', 'vision_mode
l.encoder.layers.4.mlp.fc2.bias', 'vision_model.encoder.layers.5.layer_norm2.bias', 'vision_model.encoder.layers.20.layer_norm2.weight', 'vision_model.encoder.layers.9.mlp.fc1.weight', 'vision_model.post_layernorm.weight', 'vision_model.encoder.l
ayers.22.layer_norm2.weight', 'vision_model.encoder.layers.12.mlp.fc2.weight', 'vision_model.encoder.layers.5.mlp.fc1.weight', 'vision_model.encoder.layers.1.mlp.fc1.weight', 'vision_model.encoder.layers.6.self_attn.k_proj.weight', 'vision_model.
encoder.layers.0.layer_norm1.weight', 'vision_model.encoder.layers.16.layer_norm1.bias', 'logit_scale', 'vision_model.encoder.layers.18.mlp.fc1.weight', 'vision_model.encoder.layers.14.mlp.fc2.bias', 'vision_model.encoder.layers.3.self_attn.q_pro
j.weight', 'vision_model.encoder.layers.5.layer_norm1.bias', 'vision_model.encoder.layers.1.self_attn.v_proj.bias', 'vision_model.encoder.layers.18.self_attn.q_proj.weight', 'vision_model.encoder.layers.19.mlp.fc2.weight', 'vision_model.encoder.l
ayers.21.mlp.fc2.weight', 'vision_model.encoder.layers.7.self_attn.out_proj.weight', 'vision_model.encoder.layers.11.layer_norm2.bias', 'vision_model.encoder.layers.5.self_attn.out_proj.bias', 'vision_model.encoder.layers.15.self_attn.out_proj.bi
as', 'vision_model.encoder.layers.0.layer_norm1.bias', 'vision_model.encoder.layers.6.layer_norm2.weight', 'vision_model.encoder.layers.2.self_attn.out_proj.weight', 'vision_model.encoder.layers.5.layer_norm2.weight', 'vision_model.encoder.layers
.1.self_attn.q_proj.weight', 'vision_model.encoder.layers.14.layer_norm2.weight', 'vision_model.encoder.layers.9.mlp.fc2.weight', 'vision_model.encoder.layers.7.mlp.fc1.bias', 'vision_model.encoder.layers.14.self_attn.out_proj.weight', 'vision_mo
del.encoder.layers.23.self_attn.out_proj.weight', 'vision_model.encoder.layers.8.self_attn.q_proj.weight', 'vision_model.encoder.layers.4.self_attn.k_proj.weight', 'vision_model.encoder.layers.4.self_attn.v_proj.bias', 'vision_model.encoder.layer
s.12.mlp.fc1.weight', 'vision_model.encoder.layers.22.self_attn.k_proj.bias', 'vision_model.encoder.layers.7.self_attn.k_proj.bias', 'vision_model.encoder.layers.7.self_attn.out_proj.bias', 'vision_model.encoder.layers.9.self_attn.q_proj.bias', '
vision_model.encoder.layers.1.self_attn.out_proj.weight', 'vision_model.encoder.layers.5.self_attn.q_proj.weight', 'vision_model.encoder.layers.7.self_attn.k_proj.weight', 'vision_model.encoder.layers.13.self_attn.v_proj.weight', 'vision_model.en
coder.layers.16.layer_norm2.weight', 'vision_model.encoder.layers.2.mlp.fc2.bias', 'vision_model.encoder.layers.19.layer_norm1.weight', 'vision_model.encoder.layers.5.mlp.fc1.bias', 'vision_model.encoder.layers.23.self_attn.out_proj.bias', 'visio
n_model.encoder.layers.19.layer_norm2.bias', 'vision_model.encoder.layers.22.layer_norm1.bias', 'vision_model.encoder.layers.18.layer_norm2.bias', 'vision_model.encoder.layers.3.layer_norm2.weight', 'vision_model.encoder.layers.17.layer_norm1.bia
s', 'vision_model.encoder.layers.17.mlp.fc2.weight', 'vision_model.encoder.layers.1.mlp.fc1.bias', 'vision_model.encoder.layers.10.self_attn.q_proj.weight', 'vision_model.encoder.layers.17.layer_norm2.bias', 'vision_model.encoder.layers.9.mlp.fc2
.bias', 'vision_model.encoder.layers.14.self_attn.k_proj.bias', 'vision_model.encoder.layers.22.mlp.fc1.weight', 'vision_model.encoder.layers.3.self_attn.k_proj.bias', 'vision_model.encoder.layers.11.layer_norm1.weight', 'vision_model.encoder.lay
ers.19.mlp.fc1.weight', 'vision_model.encoder.layers.13.layer_norm1.weight', 'vision_model.encoder.layers.12.layer_norm2.bias', 'vision_model.encoder.layers.5.self_attn.v_proj.bias', 'vision_model.encoder.layers.19.self_attn.k_proj.bias', 'vision
_model.encoder.layers.8.self_attn.k_proj.weight', 'vision_model.encoder.layers.18.self_attn.v_proj.bias', 'vision_model.encoder.layers.6.self_attn.k_proj.bias', 'vision_model.encoder.layers.21.self_attn.k_proj.bias', 'vision_model.encoder.layers.16.self_attn.q_proj.bias', 'vision_model.encoder.layers.22.mlp.fc2.weight', 'vision_model.encoder.layers.12.self_attn.v_proj.weight', 'vision_model.encoder.layers.0.mlp.fc2.weight', 'vision_model.encoder.layers.2.self_attn.k_proj.bias', 'vision_m
odel.encoder.layers.6.layer_norm1.weight', 'vision_model.encoder.layers.7.mlp.fc2.weight', 'vision_model.encoder.layers.10.mlp.fc1.weight', 'vision_model.encoder.layers.10.self_attn.out_proj.bias', 'vision_model.encoder.layers.23.self_attn.k_proj
.weight', 'vision_model.encoder.layers.21.mlp.fc2.bias', 'vision_model.encoder.layers.15.self_attn.q_proj.weight', 'vision_model.encoder.layers.14.self_attn.k_proj.weight', 'vision_model.encoder.layers.19.layer_norm1.bias', 'vision_model.encoder.
layers.22.self_attn.q_proj.bias', 'vision_model.encoder.layers.2.self_attn.v_proj.bias', 'vision_model.encoder.layers.8.layer_norm2.weight', 'vision_model.encoder.layers.18.self_attn.out_proj.bias', 'vision_model.encoder.layers.8.layer_norm1.bias
', 'vision_model.encoder.layers.13.layer_norm1.bias', 'vision_model.encoder.layers.17.mlp.fc2.bias', 'vision_model.encoder.layers.11.self_attn.v_proj.bias', 'vision_model.encoder.layers.16.self_attn.q_proj.weight', 'vision_model.encoder.layers.20
.mlp.fc1.bias', 'vision_model.embeddings.position_ids', 'vision_model.encoder.layers.18.layer_norm2.weight', 'vision_model.encoder.layers.20.mlp.fc1.weight', 'vision_model.encoder.layers.1.layer_norm2.weight', 'vision_model.encoder.layers.10.laye
r_norm1.bias', 'vision_model.encoder.layers.4.mlp.fc1.bias', 'vision_model.encoder.layers.13.mlp.fc2.weight', 'vision_model.encoder.layers.19.self_attn.out_proj.bias', 'vision_model.encoder.layers.11.self_attn.out_proj.weight', 'vision_model.enco
der.layers.5.mlp.fc2.bias', 'vision_model.encoder.layers.13.self_attn.v_proj.bias', 'vision_model.encoder.layers.20.self_attn.out_proj.bias', 'vision_model.encoder.layers.12.self_attn.q_proj.weight', 'vision_model.encoder.layers.9.self_attn.v_pro
j.bias', 'vision_model.encoder.layers.1.layer_norm1.bias', 'vision_model.encoder.layers.8.layer_norm1.weight', 'vision_model.encoder.layers.3.self_attn.out_proj.bias', 'vision_model.encoder.layers.12.layer_norm1.weight', 'vision_model.encoder.lay
ers.18.mlp.fc1.bias', 'vision_model.encoder.layers.16.self_attn.k_proj.weight', 'vision_model.encoder.layers.20.mlp.fc2.weight', 'vision_model.encoder.layers.2.self_attn.q_proj.weight', 'vision_model.encoder.layers.5.self_attn.k_proj.weight', 'vi
sion_model.encoder.layers.8.self_attn.v_proj.bias', 'vision_model.encoder.layers.17.self_attn.k_proj.weight', 'vision_model.encoder.layers.21.self_attn.out_proj.weight', 'vision_model.encoder.layers.23.mlp.fc2.weight', 'vision_model.encoder.layer
s.6.layer_norm1.bias', 'vision_model.encoder.layers.20.self_attn.q_proj.bias', 'vision_model.encoder.layers.17.self_attn.k_proj.bias', 'vision_model.encoder.layers.8.mlp.fc1.weight', 'vision_model.encoder.layers.10.layer_norm1.weight', 'vision_mo
del.encoder.layers.2.mlp.fc1.weight', 'vision_model.post_layernorm.bias', 'vision_model.encoder.layers.20.self_attn.k_proj.bias', 'vision_model.encoder.layers.1.layer_norm1.weight', 'vision_model.encoder.layers.8.self_attn.out_proj.bias', 'vision
_model.encoder.layers.11.mlp.fc1.weight', 'vision_model.encoder.layers.22.self_attn.k_proj.weight', 'vision_model.encoder.layers.23.self_attn.v_proj.bias', 'vision_model.encoder.layers.17.layer_norm2.weight', 'vision_model.encoder.layers.6.self_a
ttn.v_proj.weight', 'vision_model.encoder.layers.16.self_attn.k_proj.bias', 'vision_model.encoder.layers.9.self_attn.v_proj.weight', 'vision_model.encoder.layers.15.layer_norm2.bias', 'vision_model.encoder.layers.22.self_attn.out_proj.bias', 'vis
ion_model.encoder.layers.4.mlp.fc2.weight', 'vision_model.encoder.layers.9.self_attn.k_proj.weight', 'vision_model.encoder.layers.12.self_attn.k_proj.weight', 'vision_model.encoder.layers.12.self_attn.q_proj.bias', 'vision_model.encoder.layers.23
.layer_norm2.bias', 'vision_model.encoder.layers.5.layer_norm1.weight', 'vision_model.encoder.layers.10.self_attn.k_proj.weight', 'vision_model.encoder.layers.21.self_attn.q_proj.bias', 'vision_model.encoder.layers.9.self_attn.out_proj.weight', '
vision_model.encoder.layers.0.self_attn.v_proj.weight', 'vision_model.encoder.layers.7.self_attn.v_proj.bias', 'vision_model.encoder.layers.14.self_attn.out_proj.bias', 'vision_model.encoder.layers.16.self_attn.out_proj.weight', 'vision_model.enc
oder.layers.21.self_attn.k_proj.weight', 'vision_model.encoder.layers.20.self_attn.out_proj.weight', 'vision_model.encoder.layers.22.self_attn.v_proj.weight', 'vision_model.encoder.layers.22.self_attn.out_proj.weight', 'vision_model.encoder.layer
s.14.layer_norm2.bias', 'vision_model.encoder.layers.9.layer_norm2.bias', 'vision_model.encoder.layers.13.layer_norm2.weight', 'vision_model.encoder.layers.11.self_attn.out_proj.bias', 'vision_model.encoder.layers.0.self_attn.k_proj.bias', 'visio
n_model.encoder.layers.7.layer_norm1.weight', 'vision_model.encoder.layers.23.self_attn.q_proj.weight', 'vision_model.encoder.layers.1.self_attn.out_proj.bias', 'vision_model.encoder.layers.7.layer_norm2.weight', 'vision_model.encoder.layers.2.la
yer_norm1.bias', 'vision_model.encoder.layers.5.self_attn.k_proj.bias', 'vision_model.encoder.layers.14.self_attn.v_proj.weight', 'vision_model.encoder.layers.9.self_attn.out_proj.bias', 'vision_model.encoder.layers.0.self_attn.k_proj.weight', 'v
ision_model.encoder.layers.9.layer_norm2.weight', 'vision_model.encoder.layers.11.mlp.fc2.weight', 'vision_model.encoder.layers.19.mlp.fc2.bias', 'vision_model.encoder.layers.4.self_attn.out_proj.bias', 'vision_model.encoder.layers.15.layer_norm1
.bias', 'vision_model.encoder.layers.10.mlp.fc1.bias', 'vision_model.encoder.layers.21.self_attn.v_proj.bias', 'vision_model.encoder.layers.22.mlp.fc2.bias', 'vision_model.encoder.layers.16.self_attn.out_proj.bias', 'vision_model.encoder.layers.1
8.layer_norm1.weight', 'vision_model.encoder.layers.11.self_attn.q_proj.weight', 'vision_model.encoder.layers.12.self_attn.out_proj.weight', 'vision_model.encoder.layers.3.self_attn.v_proj.weight', 'vision_model.encoder.layers.8.self_attn.k_proj.
bias', 'vision_model.encoder.layers.1.self_attn.q_proj.bias', 'vision_model.encoder.layers.6.mlp.fc2.weight', 'vision_model.encoder.layers.13.mlp.fc2.bias']
- This IS expected if you are initializing CLIPTextModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing CLIPTextModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Setting up MemoryEfficientCrossAttention. Query dim is 320, context_dim is None and using 8 heads.                                                                                                                                                    
Setting up MemoryEfficientCrossAttention. Query dim is 320, context_dim is 768 and using 8 heads.                                                                                                                                                     
Setting up MemoryEfficientCrossAttention. Query dim is 320, context_dim is None and using 8 heads.                                                                                                                                                    
Setting up MemoryEfficientCrossAttention. Query dim is 320, context_dim is 768 and using 8 heads.                                                                                                                                                     
Setting up MemoryEfficientCrossAttention. Query dim is 640, context_dim is None and using 8 heads.                                                                                                                                                    
Setting up MemoryEfficientCrossAttention. Query dim is 640, context_dim is 768 and using 8 heads.                                                                                                                                                     
Setting up MemoryEfficientCrossAttention. Query dim is 640, context_dim is None and using 8 heads.                                                                                                                                                    
Setting up MemoryEfficientCrossAttention. Query dim is 640, context_dim is 768 and using 8 heads.                                                                                                                                                     
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is None and using 8 heads.                                                                                                                                                   
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is 768 and using 8 heads.                                                                                                                                                    
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is None and using 8 heads.                                                                                                                                                   
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is 768 and using 8 heads.                                                                                                                                                    
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is None and using 8 heads.                                                                                                                                                   
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is 768 and using 8 heads.                                                                                                                                                    
Loaded model config from [models_yaml/anytext_sd15.yaml]                                                                   
Loaded state_dict from [/home/myungkyu/.cache/modelscope/hub/damo/cv_anytext_text_generation_editing/anytext_v1.1.ckpt]                                                                                                                               
2024-06-08 17:15:45,715 - modelscope - INFO - initiate model from /home/myungkyu/.cache/modelscope/hub/damo/cv_anytext_text_generation_editing/nlp_csanmt_translation_zh2en
2024-06-08 17:15:45,715 - modelscope - INFO - initiate model from location /home/myungkyu/.cache/modelscope/hub/damo/cv_anytext_text_generation_editing/nlp_csanmt_translation_zh2en.
2024-06-08 17:15:45,716 - modelscope - INFO - initialize model from /home/myungkyu/.cache/modelscope/hub/damo/cv_anytext_text_generation_editing/nlp_csanmt_translation_zh2en
{'hidden_size': 1024, 'filter_size': 4096, 'num_heads': 16, 'num_encoder_layers': 24, 'num_decoder_layers': 6, 'attention_dropout': 0.0, 'residual_dropout': 0.0, 'relu_dropout': 0.0, 'layer_preproc': 'layer_norm', 'layer_postproc': 'none', 'share
d_embedding_and_softmax_weights': True, 'shared_source_target_embedding': True, 'initializer_scale': 0.1, 'position_info_type': 'absolute', 'max_relative_dis': 16, 'num_semantic_encoder_layers': 4, 'src_vocab_size': 50000, 'trg_vocab_size': 50000
, 'seed': 1234, 'beam_size': 4, 'lp_rate': 0.6, 'max_decoded_trg_len': 100, 'device_map': None, 'device': 'cuda'}
2024-06-08 17:15:45,720 - modelscope - WARNING - No val key and type key found in preprocessor domain of configuration.json file.                                                                                                                     
2024-06-08 17:15:45,721 - modelscope - WARNING - Cannot find available config to build preprocessor at mode inference, current config: {'src_lang': 'zh', 'tgt_lang': 'en', 'src_bpe': {'file': 'bpe.zh'}, 'model_dir': '/home/myungkyu/.cache/modelsc
ope/hub/damo/cv_anytext_text_generation_editing/nlp_csanmt_translation_zh2en'}. trying to build by task and model information.
2024-06-08 17:15:45,721 - modelscope - WARNING - No preprocessor key ('csanmt-translation', 'translation') found in PREPROCESSOR_MAP, skip building preprocessor.
WARNING:tensorflow:From /home/myungkyu/anaconda3/envs/anytext/lib/python3.10/site-packages/modelscope/models/nlp/csanmt/translation.py:81: calling RandomNormal.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be re
moved in a future version.
Instructions for updating:                                   
Call initializer instance with the dtype argument instead of passing it to the constructor                                                                                                                                                            
WARNING:tensorflow:From /home/myungkyu/anaconda3/envs/anytext/lib/python3.10/site-packages/modelscope/models/nlp/csanmt/translation.py:793: calling while_loop_v2 (from tensorflow.python.ops.while_loop) with back_prop=False is deprecated and will 
be removed in a future version.                              
Instructions for updating:                                   
back_prop=False is deprecated. Consider using tf.stop_gradient instead.                                                    
Instead of:                                                  
results = tf.while_loop(c, b, vars, back_prop=False)                                                                       
Use:                                                         
results = tf.nest.map_structure(tf.stop_gradient, tf.while_loop(c, b, vars))                                                                                                                                                                          
2024-06-08 17:15:54,056 - modelscope - INFO - loading model from /home/myungkyu/.cache/modelscope/hub/damo/cv_anytext_text_generation_editing/nlp_csanmt_translation_zh2en/tf_ckpts/ckpt-0
/home/myungkyu/AnyText/demo.py:382: GradioDeprecationWarning: 'scale' value should be an integer. Using 0.3 will cause issues.                                                                                                                        
  run_gen = gr.Button(value="Run(运行)!", scale=0.3, elem_classes='run')                                                   
/home/myungkyu/AnyText/demo.py:430: GradioDeprecationWarning: 'scale' value should be an integer. Using 0.4 will cause issues.                                                                                                                        
  ori_img = gr.Image(label='Ori(原图)', scale=0.4)                                                                         
/home/myungkyu/AnyText/demo.py:442: GradioDeprecationWarning: 'scale' value should be an integer. Using 0.3 will cause issues.                                                                                                                        
  run_edit = gr.Button(value="Run(运行)!", scale=0.3, elem_classes='run')                                                  
Running on local URL:  http://127.0.0.1:7860                                                                               

Thanks for being a Gradio user! If you have questions or feedback, please join our Discord server and chat with us: https://discord.gg/feTf9x3ZSB                                                                                                     

To create a public link, set `share=True` in `launch()`.                                                                   
IMPORTANT: You are using gradio version 3.50.0, however version 4.29.0 is available, please upgrade.                                                                                                                                                  
--------                                                     
Global seed set to 33789703                                  
Data shape for DDIM sampling is (4, 4, 64, 64), eta 0.0                                                                    
Running DDIM Sampling with 20 timesteps                                                                                    
DDIM Sampler: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:04<00:00,  4.16it/s]                                                                                                  
Prompt: A raccoon stands in front of the blackboard with the words  "Deep Learning"  written on it                                                                                                                                                    
Done, result images are saved in: SaveImages
myungkyuKoo commented 4 months ago
image

Prompt: A blue traffic sign with the words "Welcome!" written on it

image

Prompt: A newspaper whose title is "Breaking News"

Not only is the given template incorrect, but the general rendering access is also poor (all the parameters are set to the default values provided in the repository). Am I doing something wrong?

tyxsspa commented 4 months ago

That's not quite normal. Are the versions of your dependency libraries consistent with those in environment.yaml, such as opencv, transformers, and xformers? Are you using the Arial_Unicode.ttf font file?

myungkyuKoo commented 4 months ago
image image

Here are the glyph images obtained through debugging, each from the HuggingFace demo (left) and my server (right) respectively. I believe that I'm using the correct font style file (Arial_Unicode.ttf), but the font sizes are different, which might be causing the issue.

I'm looking for possible problems, but do you have any ideas?

Oh, I found that the font file I downloaded from the internet (arialuni.ttf) was not compatible. After I replaced it with the default font file on my Mac, the problem was solved :) Thanks for your kind help!