mpcabd / python-arabic-reshaper

Reconstruct Arabic sentences to be used in applications that don't support Arabic
MIT License
398 stars 80 forks source link

python-arabic-reshaper does not recognize all UTF-8 Arabic characters? #15

Closed caleighm closed 6 years ago

caleighm commented 6 years ago

I'm trying to use python-arabic-reshaper to help make the Python library wordcloud (#https://github.com/amueller/word_cloud/issues/70) work with Arabic text, but am having trouble.

Here's where I use arabic reshaper. Each key is UTF-8, and may use Latin, Arabic or other language characters. They display correctly outside of this wordcloud.

Situation 1

    print 'Reshaping words...'
      reshaped_words = {}
      for key in words.keys():
          decoded = key.decode('utf-8')
          reshaped = arabic_reshaper.reshape(decoded)
          reshaped_words[get_display(reshaped)] = words[key]

When I run the above (and create a wordcloud out of the reshaped_words frequencies), I get: wcfeg3

Situation 2

    print 'Reshaping words...'
      reshaped_words = {}
      for key in words.keys():
          reshaped_words[get_display(key)] = words[key]

When I run the above, I get:

wcfeg4

You'll see in the first situation, the script is correct but there are missing characters (the question boxes). In the second situation, the script is incorrect (letters are all disjointed) but none of the characters are missing.

It seems like arabic reshaper is having trouble recognizing and encoding certain Arabic (or other) characters.

Any ideas? I am not sure what I'm doing wrong. I don't speak Arabic, so I have trouble debugging or recognizing what's wrong here - I just know that it is wrong!

mpcabd commented 6 years ago

Hi @caleighm

An interesting case indeed.

I suspect that the font has something to do with this missing glyphs, as I can run the code on the word ريتويت which appears in the first image you posted image and I get it properly reshaped:

import arabic_reshaper
# had to use spaces to ensure it renders properly in GitHub
assert arabic_reshaper.reshape('ر ي ت و ي ت'.replace(' ', '')) == 'ﺭ ﻳ ﺘ ﻮ ﻳ ﺖ'.replace(' ', '')
assert arabic_reshaper.reshape('ر ي ت و ي ت'.replace(' ', '')) == '\ufead \ufef3 \ufe98 \ufeee \ufef3 \ufe96'.replace(' ', '')

http://www.fileformat.info/info/unicode/char/fead/index.htm http://www.fileformat.info/info/unicode/char/fef3/index.htm http://www.fileformat.info/info/unicode/char/fe98/index.htm http://www.fileformat.info/info/unicode/char/feee/index.htm http://www.fileformat.info/info/unicode/char/fef3/index.htm http://www.fileformat.info/info/unicode/char/fe96/index.htm

In you issue, you have two images, unfortunately they don't contain the same words, so I suggest first to run the code on the same data source with the same words and generate two images.

What I suspect is that the font used is missing glyphs for some letters that have the same look when they're isolated and when they are in other forms, like ر for example, it looks the same on its own and when it's the first letter, and when it is proceeded by a letter that doesn't have a medial or initial form, see the code here, in this case the font probably has a glyph for U+0631 but doesn't have a glyph for U+FEAD. What font are you using? Try to use other fonts and see what you'd get.

Regards

caleighm commented 6 years ago

Thanks so much - you were exactly right. I was using an Arabic font but I think my corpus must have some glyphs that weren't in that font for whatever reason. I switched to http://unifoundry.com/unifont.html and, while admittedly less attractive, all of the glyphs now show correctly. Thank you!

mpcabd commented 6 years ago

Hi @caleighm,

In the latest version v2.0.14 I added a new option use_unshaped_instead_of_isolated that will help you get around the problem, please test it if you can and confirm it's working for your case as well.

Regards