mpcabd / python-arabic-reshaper

Reconstruct Arabic sentences to be used in applications that don't support Arabic
MIT License
395 stars 81 forks source link

Arabic text reversed with connected letters not reshaped correctly #69

Open AnasAG opened 3 years ago

AnasAG commented 3 years ago

I have a script for extracting Arabic text from PDF. pdfminer lib is used for pdf parsing. When extracting the Arabic text, sentences were reversed but the letters in each work were connected.

Original text in PDF: "وضح المقصود بكل من المصطلحات التالية" Extracted text from PDF: "ﺓﻳﻼﺗﻼ ﺗﺎﺣﻠﻄﺼﻤﻼ ﻧﻢ ﻟﻜﺐ ﺩﻭﺻﻘﻤﻼ ﺣﻀﻮ"

When using arabic_reshaper I noticed a situation where the Arabic text is not formatted correctly.

Sample Code:

import arabic_reshaper
from bidi.algorithm import get_display

text = "ﺓﻳﻼﺗﻼ ﺗﺎﺣﻠﻄﺼﻤﻼ ﻧﻢ ﻟﻜﺐ ﺩﻭﺻﻘﻤﻼ ﺣﻀﻮ"

reshaped_text = arabic_reshaper.reshape(text)    # correct its shape
print(reshaped_text)
# result: ﺓﻳﻼﺗﻼ ﺗﺎﺣﻠﻄﺼﻤﻼ ﻧﻢ ﻟﻜﺐ ﺩﻭﺻﻘﻤﻼ ﺣﻀﻮ

bidi_text = get_display(reshaped_text)
print(bidi_text)
# result: ﻮﻀﺣ ﻼﻤﻘﺻﻭﺩ ﺐﻜﻟ ﻢﻧ ﻼﻤﺼﻄﻠﺣﺎﺗ ﻼﺗﻼﻳﺓ

But, when using an Arabic text similar to the previous example (reversed) but the letters are isolated (not connected), arabic_reshaper did work properly.

Original text in PDF: "على الترتيب (n-l-m-s) اكتب جميع اعداد الكم الاربعة" Extracted text from PDF: "ﺐﻴﺗﺮﺘﻟﺍ ﻰﻠﻋ (n-l-m-s) ﺔﻌﺑﺭﻻﺍ ﻢﻜﻟﺍ ﺩﺍﺪﻋﺍ ﻊﻴﻤﺟ ﺐﺘﻛﺍ"

Sample code:

import arabic_reshaper
from bidi.algorithm import get_display

text = "ﺐﻴﺗﺮﺘﻟﺍ ﻰﻠﻋ (n-l-m-s) ﺔﻌﺑﺭﻻﺍ ﻢﻜﻟﺍ ﺩﺍﺪﻋﺍ ﻊﻴﻤﺟ ﺐﺘﻛﺍ"

reshaped_text = arabic_reshaper.reshape(text)    # correct its shape
print(reshaped_text)
# result:  ﺐﻴﺗﺮﺘﻟﺍ ﻰﻠﻋ (n-l-m-s) ﺔﻌﺑﺭﻻﺍ ﻢﻜﻟﺍ ﺩﺍﺪﻋﺍ ﻊﻴﻤﺟ ﺐﺘﻛﺍ

bidi_text = get_display(reshaped_text)
print(bidi_text)
# result: على الترتيب (n-l-m-s) اكتب جميع اعداد الكم الاربعة

I couldn't find out why it behaves this way. Also tried using the ArabicReshaper class with configuration and changing args such as use_unshaped_instead_of_isolated and support_ligatures, but the behavior was the same. The pdf font affects the extracted text output, it might be also why the text sometimes is extracted with connected or isolated letters/alphabets. Though in general, I'm not sure if it's a bug or related to ligatures or other causes.

naourass commented 1 year ago

I'm running into this same issue. All my target text is in join format. Is it possible to isolate the letters when they're joined?

abdelmalek13 commented 3 months ago

I have the same problem during extracted data from pdf