oh-my-ocr / text_renderer

https://oh-my-ocr.github.io/text_renderer/README.html
MIT License
801 stars 161 forks source link

Unable to create images and labels for non Latin languages correctly #69

Open asif-ca opened 9 months ago

asif-ca commented 9 months ago

Unable to create images and labels for non-Latin languages correctly. For example, I want to create labels and images for Bangali or Gujarati languages

Here is what is happening:

It creates data for Bengali languages (I set up corpus and char_file.txt and fonts for Bangali) However the issue is that the text rendered on the image is incorrect that is taken from the corpus (the label is correct).

The sequence of characters sometimes changes in images: 000347671 000347671.jpg থেকে বঞ্চিত

The label text is correct but the text rendered on the image is incorrect.

@ohmyocr, @Sanster @mikeshi80 , @ELanning Any solution? Thanks

ELanning commented 9 months ago

Does this happen with every font file you've tried? This kind of problem is generally an issue with the font file, in my experience. Although I cannot tell what the issue is with your sample image, it looks like it matches the label to me, but that's probably because I cannot read Bangali or Gujarati.

EDIT: image

Never mind, I see the issue now. This could be a problem with LTR vs RTL languages. Not sure if the underlying library handles those. I'll take a look.

EDIT2:

It could be the font file still, it could be some kind of RTL (right to left) vs LTR (left to right) change is needed: https://gist.github.com/eamirgh/7f7dae86fcee4eda73f0b5fde4e1e630

I recommend experimenting with both.

asif-ca commented 9 months ago

Thanks @ELanning for your prompt reply.

Bengali is written from left to right (LTR), However as you mentioned it might be a font file issue but I assume fonts should not change the sequence of characters in general ... what do you think?

Which file I can modify to experiment with image drawing in this script? Any workaround?

ELanning commented 9 months ago

but I assume fonts should not change the sequence of characters in general ... what do you think?

Font files can do some crazy things.

Could you kindly try this Noto Sans Bengali font

And let me know how it renders?

On the Google preview, it looks like it handles the label fine, so this will confirm if the issue is font related or library related.

image

Which file I can modify to experiment with image drawing in this script? Any workaround?

I didn't write that image drawing script, but it looks like it's not necessary now that I know Bengali is LTR.

asif-ca commented 9 months ago

I tried above font in script, this is how its works, looks like the script change some specific character's sequence no matter what font is being used.

image image

ELanning commented 9 months ago

It sounds like this could be an issue in pillow, in which case there's not much text_renderer can do.

You may want to try creating a minimally reproducible code snippet and posting it to the pillow github issues. I believe they'd be greatly interested in fixing something like this, or may be able to provide more in-depth troubleshooting advice. Sorry I couldn't be of more help.

asif-ca commented 9 months ago

Thanks for your guidance and direction.