py-pdf / pypdf

A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files
https://pypdf.readthedocs.io/en/latest/
Other
8.31k stars 1.41k forks source link

Arabic text is extracted in the wrong order #1296

Closed baaziznasser closed 2 years ago

baaziznasser commented 2 years ago

There is a big problem with arabic text extraction.

If we have a string that says (مرحبا هذه تجربة) the PyPDF2 extract_text function returned it like : (ةبرجت هذه ابحرم).

Environment

$ python -m platform
Windows-10-10.0.19044-SP0

$ python -c "import PyPDF2;print(PyPDF2.__version__)"
2.10.3

Code + PDF

This is a minimal, complete example that shows the issue with file.pdf:

from PyPDF2 import PdfReader

reader = PdfReader("file.pdf")

text = ""
for page in reader.pages:
    text += page.extract_text()
    break
print(text)

It gives:

Ang-L1+sociology-globalisation: 

 :مﻗر ة رﻀﺎحمﻟا1              : ﺔمﻟوﻌﻟاglobalization : 

 : ﺔمﻟوﻌﻠﻟ ﺔ�خ�رﺎتﻟا ﺔ�فﻠخﻟا 
 : ﺢﻠطصمﻟا  ﺢﻠطصمﻟا ﺔمﻟوﻌﻟاmondialisation  ﺔمﻠكﻟا نﻤmonde  ﺔ�نیﺘﻼﻟا ﺔمﻠكﻟا نﻤ ةدمتسﻤmundus 

 ﻲنﻌﺘ ﻲتﻟاوunivers.  و ،globe  ﺔ�نیﺘﻼﻟا ﺔمﻠكﻟا نﻤglobus   .ءﻲشﻟا م�مﻌﺘ ﻲنﻌﺘو 
: ﺎﻬﻔ�رﻌﺘ 
  نودﺒ وأ دصﻘ� ﻰﻌسﺘ ﻲتﻟا تا روطتﻟا و تادجتسمﻟا " ﺎﻬﻨﺄ� ﺔمﻟوﻌﻟا بﺎت� ﻒﻟؤﻤ :زرﺘاو موكﻟﺎﻤ ﺎﻬﻓ رﻌ�.دﺤاو ﻲمﻟﺎﻋ ﻊمتجﻤ ﻲﻓ مﻟﺎﻌﻟا نﺎكﺴ ﺞﻤد ﻰﻟإ دصﻗ 
Globalization  ﻞكﻟا ﻞمش�ﻟ ﻪﺘ رﺌاد ﻊ�ﺴوﺘ و ءﻲشﻟا م�مﻌﺘ : ﻲنﻌﺘ ﻲﻬﻓ."ادﺤاو ﺎمﻟﺎﻋ مﻟﺎﻌﻟا ﻞﻌﺠ "يأ ﻞﺴا ر زیﺘ زﻟ رﺎشﺘ ﻲك� رﻤﻷا وﻫ ﺔمﻟوﻌﻠﻟ ﻰﻟوﻻا ئدﺎ�مﻟا ﻊﻀوﺒ مﺎﻗ نﻤ لوأ charles taze russell  
  " ﺢﻠطصﻤ ﻰﻟإ ﻞﺼوﺘ نﻤ لوأ دﻌ� ﻪﻨا ﺎم� ،تﺎ� رﺸ بﺤﺎﺼ و ﺎ�ﻟﺎمﺴأ ر نﺎ� نأ دﻌ� ﺎسﻗ ﺢ�ﺼأ يذﻟا  مﺎﻋ ﻲﻓ"ﺔﻗﻼمﻌﻟا تﺎ� رشﻟا1897 . 
ا ﻲﻓ كﻟذ� ﺢﻠطصمﻟا ﻞمﻌتﺴأ  نﺎﺘ ری�و� رﺎ�ﺒ ف رط نﻤ ة رﻤ لوﻷ ﺔ�سﻨ رﻔﻟpierre de coubertin  ﻲ ﻓ  ة د �ر ﺠle figaro
  م ﺎ ﻋ13  ربمس�د1904 . 
ﻲسﻨ رﻔﻟا ﻲﻓا رﻐجﻟا خ رؤمﻟا بﺎت� ﻲﻓ رﻬظ مﺜ    vincent copdepny  ت� رﺘأ لوﺒ ل بﺎت� ﻲﻓ رشﻨ ، paul otlet   ﺔنﺴ1916 "ﻲنیﺠ نﺎﻓ دﻟوﻨ رأ" دﯿ ﻰﻠﻋ مﺜ.arnold von gennep ,1933   ظ و ﻲﻓا رﻐجﻟا بﺎت� ﻲﻓ كﻟذ� ت رﻬle géographe laurent carroué    ﻰﻟإ ﺔﻓﺎﻀﻹﺎ� ،يوورﺎ� نا روﻟ   ﻪ�ﺸور ﻲﻏguy rocher     نﺎﻫوﻟ كﺎﻤ لﺎﺸ رﺎﻤ دﯿ ﻰﻠﻋ و ، ﺎﻬﻔ� رﻌﺘ لوﺎﺤ يذﻟاmarshall Mc luhan   ﻪﻔﻟؤﻤ ﻲﻓvillage Global "   ﻲﻨﺎ�ﺴﻹا ز� رﯿ ﺎمﻛmanuel castell’s    و ﺔ�دﺎصتﻗﻻا ﻞﻤاوﻌﻟا ﻰﻠﻋ ﻻا ز� ر�و ، ﺔ�عﺎمتﺠjohn urry
   ﺦﻟإ...،ﺔ�ﺴﺎ�سﻟا و ﺔ�فﺎﻘثﻟا ،ﺔ�دﺎصتﻗﻻا ، ﺔ�ﻨﺎسﻨﻹا تا رییﻐتﻟا ﻰﻠﻋ 
: ةﺄشنﻟا : ﺔمﻟوﻌﻟا روﻬظ بﺎ�ﺴأ 
: ﺔ�دﺎصتﻗﻹا ﻞﻤاوﻌﻟا 
 ﺔ�ﻨﺎط� ربﻟا ﺔ�ق رشﻟا دنﻬﻟا ﺔ� رﺸ س�ﺴﺄﺘ . 
 : ﺔ�ﻟﺎﻤو ﺔ� رﺎجﺘ ، ﺔ�عﺎنﺼ ، ﺔ�ﺠﺎتﻨإ ﻞﻤاوﻋ 
- ة روثﻟا و ﺔ�عﺎنصﻟا ة روثﻟا.ﺔ�ﺠوﻟونكتﻟا 
- .ﺔ�ﻟﺎمﻟا تﺎسﺴؤمﻟا لﻼﺨ نﻤ ﻲﻟﺎﻤ دﺎصتﻗا روﻠبﺘ ق� رط نﻋ لاوﻤﻷا سوؤر 
-.لاوﻤﻷا سوؤر لﺎﻘتﻨاو ﺔ� رﺎجتﻟا ﺔ� رحﻟا ةدﺎ� ز 

but it's partially reversed, e.g. the beginning

image

should be

‫‪:‬‬ ‫‪globalization‬‬ ‫‪:‬‬ ‫العولمة   ‫‪1:‬‬ ‫رقم‪:‬‬ ‫المحاضرة‬
‬
MartinThoma commented 2 years ago

Hi @baaziznasser ,

Thank you for letting us know that there is an issue. It would have been way more helpful if you used the standard bug template. I've added the relevant parts to your post. Please fill out the TODO

baaziznasser commented 2 years ago

@MartinThoma done as i said on the example the arabic text returned reversed please try to solve that as soon as you can, because i realy need it before the school come. am working to make a text reader for the blind users. help please

baaziznasser commented 2 years ago

i made this function to try to solve the problem but if there is numbers within the word or none arabic chars it also reversed

def arabic_reverse(string): pattern = re.compile(r'[ا-يئءؤأإآةًٌٍَُِّْ]') if not pattern.findall(string): return string ar = "" for line in string.split("\n"): pattern = re.compile(r'[ا-يئءؤأإآةًٌٍَُِّْ]') if pattern.findall(line): line = line[::-1] arline = [] for word in line.split(): pattern = re.compile(r'[ا-يئءؤأإآةًٌٍَُِّْ]') if not pattern.findall(word): arline.append(word[::-1]) else: arline.append(word) ar = f"{ar}\r\n{' '.join(arline)}"

MartinThoma commented 2 years ago

Do you have an example pdf which you can upload that shows the issue?

baaziznasser commented 2 years ago

@MartinThoma here is a pdf test am sorry i wasn't able to upload it to the main post i don't know why

file.pdf

pubpub-zz commented 2 years ago

under analysis. @baaziznasser, to confirm my understanding: a) the arabic characters are inversed : am I right? I've tried to produce a very simple example using the 3 keys on left of the keyboard arab.pdf in the file the code character (purely arbitrary codes)are

<034e> <0321> <032d> those are then translated into "\u0626","\u0634","\u0636" = ('ئ', 'ش', 'ض') it could be reversed, compared to print, but I do not know how to detect RTL(I would like to have the general rule)Any Idea ?
baaziznasser commented 2 years ago

@pubpub-zz thanks for your reply i don't fully understand what you said but if you mean how to detect the rtl text i was using this you can get an idea from it

import unicodedata as UD
texts = ['مرحبا'.encode('utf-8').decode('utf-8'), 'Hello']
for text in texts:
                x = len([None for ch in text if UD.bidirectional(ch) in ('R', 'AL')])/float(len(text))
                print('{t} => {c}'.format(t=text.encode('utf-8'), c='RTL' if x>0.5 else 'LTR'))
baaziznasser commented 2 years ago

here is an example to make it cleer for you if we say that we have a line that have this text : هذا مثال على المشكل الذي يواجهني when i used pypdf2 ii i got this: ينهجاوي يذلا لكشملا ىلع لاثم اذه

it means each line here is reversed from the first to the last char i used this function to check and reverse

import re
def arabic_reverse(string):
    pattern = re.compile(r'[ا-يئءؤأإآةًٌٍَُِّْ]')
    if not pattern.findall(string):
        return string
    ar = ""
    for line in string.split("\n"):
        pattern = re.compile(r'[ا-يئءؤأإآةًٌٍَُِّْ]')
        if pattern.findall(line):
            line = line[::-1]
        arline = []
        for word in line.split():
            pattern = re.compile(r'[ا-يئءؤأإآةًٌٍَُِّْ]')
            if not pattern.findall(word):
                arline.append(word[::-1])
            else:
                arline.append(word)
        ar = f"{ar}\r\n{' '.join(arline)}"

    return ar

but here i found a big problem for example if i have as pdf text محمد21 i got from the library دمحم21 if i use this method to reverse the text i will get محمد12 hope evvery thing became cleer for you. and i wish you can help me to solve the problem.

pubpub-zz commented 2 years ago

@baaziznasser, What I mean is : how the PDF viewer (acrobat Reader or any one else) knows that the text is written right to left and not left to right. I'm having a look at PDF.js

MartinThoma commented 2 years ago

There is https://en.m.wikipedia.org/wiki/Right-to-left_mark, but I have no clue what pdf uses.

Is that symbol maybe present and we would need to "reset" the writing direction?

baaziznasser commented 2 years ago

@MartinThoma yes bro i think that it exists,here a link from microsoft also talking about this https://docs.microsoft.com/en-us/dynamics365/fin-ops-core/dev-itpro/user-interface/bidirectional-support

@pubpub-zz i tried also to search with some projects on html and java script i found this example and converted it to python

rtl_check

#this example is tested with arabic and hebrew languages
import re
def rtl_check(txt):
    rtl_chars = u'\u0591-\u07FF\uFB1D-\uFDFD\uFE70-\uFEFC'
    ltr_chars = u'A-Za-z\u00C0-\u00D6\u00D8-\u00F6\u00F8-\u02B8\u0300-\u0590\u0800-\u1FFF'+'\u2C00-\uFB1C\uFDFE-\uFE6F\uFEFD-\uFFFF'
    regexp = '^[^'+ltr_chars+']*['+rtl_chars+']'
    match = re.match(regexp, txt)
    return (match != None)

also i read this pdf Understanding Bidirectional (BIDI) Text in Unicode.pdf the writer tried to explane the idea of RLM.

hope you can find a way to do that.

pubpub-zz commented 2 years ago

@baaziznasser thanks for the RTL checks. It's very similar to what I've found and started to implement.

I've produced this first draft of the PR. I'm facing lots of issues:

Applied on the begin this is my first results

Ang-L1+sociology-globalisation: 

 :مرﻗ ة راﻟمحﺎﻀ1              : اﻟﻌوﻟمﺔglobalization : 

 : ﻟﻠﻌوﻟمﺔ ﺔ�خ�راﻟتﺎ ﺔ�فاﻟخﻠ 

do not hesitate to use _debug_for_extract() to get the raw code from PDF and the translation code

If you have so ideas, or if you can produce some very test sample it will be welcomed.

pubpub-zz commented 2 years ago

There is https://en.m.wikipedia.org/wiki/Right-to-left_mark, but I have no clue what pdf uses.

Is that symbol maybe present and we would need to "reset" the writing direction?

I did not found this character which would have helped.😥 Nothing has been found neither in text nor in the font to know what to do...

pubpub-zz commented 2 years ago

@baaziznasser

can you have a look at the output from pdfminer.six : are they good ?

baaziznasser commented 2 years ago

pdfminer

no the same problem i tried to made it as an image and sent it to the tesseract python library it extracted without problems now am making a sample as pdf am writing it using word and i will save it as pdf just to help us with tests

baaziznasser commented 2 years ago

pdf_test.pdf @pubpub-zz here is a pdf sample i will also try to test with it and give you the result

pubpub-zz commented 2 years ago

@baaziznasser thanks for the test file

in order to get rid of issues of display, I'm using pyperclip to copy the text and paste it into notepad

baaziznasser commented 2 years ago

@pubpub-zz here are the first words as you said

ت ج ر ب ة ا س ت خ ل ا ص ن ص م ن م ل ف

when tested with adobe acrobat the text almost currect.

baaziznasser commented 2 years ago

@pubpub-zz here is the text that i wrote on word

تجربة استخلاص نص من ملف pdf حيث تهتم هذه التجربة بتقييم جودة التحويل الخاص بمكتبة pypdf2 وكيف تستخلص اللغة العربية. في هذا السطر ستكون الكتابة كلها عربية، وسوف أستخدم علامات الترقيم العربية أيضا، فيا ترى كيف ستكون الجودة هنا؟ • أما هذا السطر سوف أحاول أن أدخل بعض الكلمات بلغات أخرى مثل enter باللغة الإنجليزية english language والتي يقصد بها دخول. وفي هذا السطر سوف أحاول أن أكتب كلمات عربية ملتصقة مع أرقام على سبيل المثال محمد21 وناصر58 و123محمود و 54سامي85 أما هنا سوف تكون الفواصل علامات وليس أرقام كمثلا محمد-العربي وأحمد_الجزائري كما هنالك ناصر_abcd. at the end hope this pdf sample is good for testing the rtl language order.

pubpub-zz commented 2 years ago

Things seems to improve: the results in notepad (comparing Adode Reader / PyPDF2) image

I've paste an image as the results when pasting in github is different: تجربة استخالص نص من ملفpdf حيث تهتم هذه التجربة بتقييم جودة التحويل الخاص بمكتبة pypdf2 .وكيف تستخلص اللغة العربية في هذا السطر ستكون الكتابة كلها عربية، وسوف أستخدم عالمات الترقيم العربية أيضا، فيا ترى كيف ستكون الجودة هنا؟ • أما هذا السطر سوف أحاول أن أدخل بعض الكلمات بلغات أخرى مثل enter باللغة اإلنجليزية english language والتي يقصد .بها دخول وفي هذا السطر سوف أحاول أن أكتب كلمات عربية ملتصقة مع أرقام على سبيل المثال محمد21 وناصر58 و123 محمود و 54سامي85 أما هنا سوف تكون الفواصل عالمات وليس أرقام كمثال محمد- _الجزائري كما هنالك ناصر_العربي وأحمد abcd . at the end hope this pdf sample is good for testing the rtl language order .

What I've noticed:

Can you check with this proposal also for my understanding Am I Right in the way to read the sentence? image

baaziznasser commented 2 years ago

image

@pubpub-zz take a look to this image it from notepad the first line is normal arabic text but it is ltr direction, this is not a problem because we can read it at least but the second line is from pypdf2 as you can see here the text is ltr but not as the first, here the words couldn't read, because the order is totally reversed where مرحبا become ابحرم i think the only way we can use is to check for the rtl chars and try to reverse them i know it will take more time to load but i think it can be an optional param in the extract_text function because i tried a lot searching how to do that and what is the best way, i couldn't find any thing may solve this problem without checking chars one by one or word by word

pubpub-zz commented 2 years ago

@baaziznasser, have you tried to deploy the PR #1305 ? if not using git, just replace _pages.py with https://raw.githubusercontent.com/py-pdf/PyPDF2/5ec7f67d6a7cda4b989128eed4c93b3ef6632a75/PyPDF2/_page.py

baaziznasser commented 2 years ago

@pubpub-zz am realy sorry, i just tested out the rp for my test it's worked very very very well every word is in it place

baaziznasser commented 2 years ago

give me some time i'll test it with more files to try to find if there is any problems

pubpub-zz commented 2 years ago

Good!!!! 🎉🎉🎉🎊

while I clean up can you do a further testing. Can you also give your position what I've noticed

What I've noticed:

* on line 1 "pdf" is on right and not on left (from direct PDF viewing) (and at the correct position when pasting in web page) : the text is in a separate group : no solution yet found as the text is already append to buffer...

* on line 3 the dot is on left not on right: Because it is the first character on line it is considered LTR and then stays on left : currently no solutions identified.

* some spaces (due to text groups readjustments) are introduced.

Can you check with this proposal also for my understanding Am I Right in the way to read the sentence?

Thanks

baaziznasser commented 2 years ago

@pubpub-zz sorry i forget to reply to your message. so sir, about your reading, yes you are reading the text in the currect way and about your notes : in line one the (pdf) word as you said. in line 3 i made an error i wrote a big line, so when it converted to pdf the adobe acrobat dc devided the line and when it did word rapping the (.) became in the left.

baaziznasser commented 2 years ago

@pubpub-zz to be onest right now the pypdf2 is the same like adobe acrobat and google chrome i tested the library with 10 pdfs and i got the same result. there is some things if you do them the pypdf2 will be better than all readers, i know that it not easy but at least we can try. so in the rtl languages and specificly in arabic language there are no vowels, we are using some thing called tashkill and the simbols are (ًٌٍَُِّْ) when the pdf have this simbols the text will be full of errors on pdf viewers. i will change the pdf_test file to add this simbols to it to know how it become. also i find a library named "arabic_reshaper" you can take a look to it, this can deal with these simbols. but as i said, now the library result is the same of what we get from adobe reader and google chrome

here is the pdf test

test_pdf.pdf

baaziznasser commented 2 years ago

@pubpub-zz here is the result from adobe acrobat image

and here is the result from the pypdf2 image

i found that the pypdf2 is better than adobe here and to make the test better here a result from google chrome

image

pubpub-zz commented 2 years ago

This is what I get : Viewer image Copy and Paste from Acrobat Reader image and with PyPDF2: image

Too difficult for me to know if it is good or bad...

pubpub-zz commented 2 years ago

can you clarify in what google Chrome (PDF.js actually) is better ? is it also better after copy/paste

baaziznasser commented 2 years ago

@pubpub-zz REALY AM SORRY AM BOTHRING YOU, but i want to make this very good library working well with arabic language as it is with the english language, so there is a problem with the tashkil (َ ً ُ ٌٌ ْ ِ ٍ ّ ـ) if you notesed that in your image from the viewer كَيْفَ and in the other image it is (ﻛَﯾْ ﻓِﻲ) the currect is كَيْفَ. hope we can find a way to deal with these simbols.

pubpub-zz commented 2 years ago

No prob. can you check what is the results when you are copy/pasting from google Chrome ? The point is viewing and pasting are using 2 different source of transcoding and we can not use the way "Display" is working.

baaziznasser commented 2 years ago

@pubpub-zz the google chrome displaying text is good. but when we select and copy the text, it not organized at least for screen readers, this problem is only when the tashkil simbols is used.

i think if we can not deal with tashkil we can make an option to replace tashkil with nothing, maybe the problem will solved. for sure that is not recomended because there is some text must be with these simbols to be understandable.

pubpub-zz commented 2 years ago

I may have an explanation : I started some test with " ْ " (sukun ?)
when I paste this character in github page editor and then I type a character (even a roman character) I get the letter is draw at the same vertical position : example " ْd " but when I do the same in notepad the characters are disjoined image

pubpub-zz commented 2 years ago

i think if we can not deal with tashkil we can make an option to replace tashkil with nothing, maybe the problem will solved. for sure that is not recomended because there is some text must be with these simbols to be understandable.

I dislike the idea of changing the extraction : this can be done by a simple replace after no ? @MartinThoma your opinion

baaziznasser commented 2 years ago

@pubpub-zz yes notepad or any none rich edit will mmerge between the arabic char and the tashkil, because as i said the tashkil like fatha and sucon and casra they used like the A and e and i and o and u when used with the english ltrs e.g فَ read fa فِ = fi فُ = fo ... so the notepad deal with them like a singol letter. but any richedit will siparate them

pubpub-zz commented 2 years ago

.... so the notepad deal with them like a singol letter...

Don't think so... This is 2 characters. is is not like for french ^+e = ê ("\u00E4") which do be a single character

baaziznasser commented 2 years ago

@pubpub-zz yes, am talking about the display only, if for example tried to move using the left and right keys or try to select a char the tashkil will also selected with it, but if you try that wit microsoft word or any richedit the char will be selected without the tashkil.