joanrod / ocr-vqgan

OCR-VQGAN, a discrete image encoder (tokenizer and detokenizer) for figure images in Paper2Fig100k dataset. Implementation of OCR Perceptual loss for clear text-within-image generation. Fork from VQGAN in CompVis/taming-transformers
https://arxiv.org/abs/2210.11248
74 stars 1 forks source link

vqgan result #6

Open Winnie202 opened 1 year ago

Winnie202 commented 1 year ago

Thank you for sharing the code,I used taming-tranformer to did the image reconstruction for Street View,but smaller text sections don't work well. If i use this model to train this type of dataset can optimize the reconstruction results of vqgan with small text,like these: 233952925-d06ce36a-19c0-49b0-aff6-f68c7ef89e03

233952976-b6de91dd-06ea-43f8-882c-04c776d13ecc

233952863-3f321c3c-ff90-4ad8-a932-434ae106378e

joanrod commented 1 year ago

Hi @Winnie202, awesome results!! I'm glad you could reconstruct the small texts. We could try generating synthetic scene-text as a follow-up

Winnie202 commented 1 year ago

awesome results!! I'm glad you could reconstruct the small texts. We could try generating synthetic scene-text as a follow-up

Do you have any good suggestions for improving the reconstruction of the text in these scenes