Closed baaziznasser closed 2 years ago
Hi @baaziznasser ,
Thank you for letting us know that there is an issue. It would have been way more helpful if you used the standard bug template. I've added the relevant parts to your post. Please fill out the TODO
@MartinThoma done as i said on the example the arabic text returned reversed please try to solve that as soon as you can, because i realy need it before the school come. am working to make a text reader for the blind users. help please
i made this function to try to solve the problem but if there is numbers within the word or none arabic chars it also reversed
def arabic_reverse(string): pattern = re.compile(r'[ا-يئءؤأإآةًٌٍَُِّْ]') if not pattern.findall(string): return string ar = "" for line in string.split("\n"): pattern = re.compile(r'[ا-يئءؤأإآةًٌٍَُِّْ]') if pattern.findall(line): line = line[::-1] arline = [] for word in line.split(): pattern = re.compile(r'[ا-يئءؤأإآةًٌٍَُِّْ]') if not pattern.findall(word): arline.append(word[::-1]) else: arline.append(word) ar = f"{ar}\r\n{' '.join(arline)}"
Do you have an example pdf which you can upload that shows the issue?
@MartinThoma here is a pdf test am sorry i wasn't able to upload it to the main post i don't know why
under analysis. @baaziznasser, to confirm my understanding: a) the arabic characters are inversed : am I right? I've tried to produce a very simple example using the 3 keys on left of the keyboard arab.pdf in the file the code character (purely arbitrary codes)are
<034e> <0321> <032d> those are then translated into "\u0626","\u0634","\u0636" = ('ئ', 'ش', 'ض') it could be reversed, compared to print, but I do not know how to detect RTL(I would like to have the general rule)Any Idea ?@pubpub-zz thanks for your reply i don't fully understand what you said but if you mean how to detect the rtl text i was using this you can get an idea from it
import unicodedata as UD
texts = ['مرحبا'.encode('utf-8').decode('utf-8'), 'Hello']
for text in texts:
x = len([None for ch in text if UD.bidirectional(ch) in ('R', 'AL')])/float(len(text))
print('{t} => {c}'.format(t=text.encode('utf-8'), c='RTL' if x>0.5 else 'LTR'))
here is an example to make it cleer for you if we say that we have a line that have this text : هذا مثال على المشكل الذي يواجهني when i used pypdf2 ii i got this: ينهجاوي يذلا لكشملا ىلع لاثم اذه
it means each line here is reversed from the first to the last char i used this function to check and reverse
import re
def arabic_reverse(string):
pattern = re.compile(r'[ا-يئءؤأإآةًٌٍَُِّْ]')
if not pattern.findall(string):
return string
ar = ""
for line in string.split("\n"):
pattern = re.compile(r'[ا-يئءؤأإآةًٌٍَُِّْ]')
if pattern.findall(line):
line = line[::-1]
arline = []
for word in line.split():
pattern = re.compile(r'[ا-يئءؤأإآةًٌٍَُِّْ]')
if not pattern.findall(word):
arline.append(word[::-1])
else:
arline.append(word)
ar = f"{ar}\r\n{' '.join(arline)}"
return ar
but here i found a big problem for example if i have as pdf text محمد21 i got from the library دمحم21 if i use this method to reverse the text i will get محمد12 hope evvery thing became cleer for you. and i wish you can help me to solve the problem.
@baaziznasser, What I mean is : how the PDF viewer (acrobat Reader or any one else) knows that the text is written right to left and not left to right. I'm having a look at PDF.js
There is https://en.m.wikipedia.org/wiki/Right-to-left_mark, but I have no clue what pdf uses.
Is that symbol maybe present and we would need to "reset" the writing direction?
@MartinThoma yes bro i think that it exists,here a link from microsoft also talking about this https://docs.microsoft.com/en-us/dynamics365/fin-ops-core/dev-itpro/user-interface/bidirectional-support
@pubpub-zz i tried also to search with some projects on html and java script i found this example and converted it to python
#this example is tested with arabic and hebrew languages
import re
def rtl_check(txt):
rtl_chars = u'\u0591-\u07FF\uFB1D-\uFDFD\uFE70-\uFEFC'
ltr_chars = u'A-Za-z\u00C0-\u00D6\u00D8-\u00F6\u00F8-\u02B8\u0300-\u0590\u0800-\u1FFF'+'\u2C00-\uFB1C\uFDFE-\uFE6F\uFEFD-\uFFFF'
regexp = '^[^'+ltr_chars+']*['+rtl_chars+']'
match = re.match(regexp, txt)
return (match != None)
also i read this pdf Understanding Bidirectional (BIDI) Text in Unicode.pdf the writer tried to explane the idea of RLM.
hope you can find a way to do that.
@baaziznasser thanks for the RTL checks. It's very similar to what I've found and started to implement.
I've produced this first draft of the PR. I'm facing lots of issues:
Applied on the begin this is my first results
Ang-L1+sociology-globalisation:
:مرﻗ ة راﻟمحﺎﻀ1 : اﻟﻌوﻟمﺔglobalization :
: ﻟﻠﻌوﻟمﺔ ﺔ�خ�راﻟتﺎ ﺔ�فاﻟخﻠ
do not hesitate to use _debug_for_extract() to get the raw code from PDF and the translation code
If you have so ideas, or if you can produce some very test sample it will be welcomed.
There is https://en.m.wikipedia.org/wiki/Right-to-left_mark, but I have no clue what pdf uses.
Is that symbol maybe present and we would need to "reset" the writing direction?
I did not found this character which would have helped.😥 Nothing has been found neither in text nor in the font to know what to do...
@baaziznasser
can you have a look at the output from pdfminer.six : are they good ?
pdfminer
no the same problem i tried to made it as an image and sent it to the tesseract python library it extracted without problems now am making a sample as pdf am writing it using word and i will save it as pdf just to help us with tests
pdf_test.pdf @pubpub-zz here is a pdf sample i will also try to test with it and give you the result
@baaziznasser thanks for the test file
in order to get rid of issues of display, I'm using pyperclip to copy the text and paste it into notepad
@pubpub-zz here are the first words as you said
ت ج ر ب ة ا س ت خ ل ا ص ن ص م ن م ل ف
when tested with adobe acrobat the text almost currect.
@pubpub-zz here is the text that i wrote on word
تجربة استخلاص نص من ملف pdf حيث تهتم هذه التجربة بتقييم جودة التحويل الخاص بمكتبة pypdf2 وكيف تستخلص اللغة العربية. في هذا السطر ستكون الكتابة كلها عربية، وسوف أستخدم علامات الترقيم العربية أيضا، فيا ترى كيف ستكون الجودة هنا؟ • أما هذا السطر سوف أحاول أن أدخل بعض الكلمات بلغات أخرى مثل enter باللغة الإنجليزية english language والتي يقصد بها دخول. وفي هذا السطر سوف أحاول أن أكتب كلمات عربية ملتصقة مع أرقام على سبيل المثال محمد21 وناصر58 و123محمود و 54سامي85 أما هنا سوف تكون الفواصل علامات وليس أرقام كمثلا محمد-العربي وأحمد_الجزائري كما هنالك ناصر_abcd. at the end hope this pdf sample is good for testing the rtl language order.
Things seems to improve: the results in notepad (comparing Adode Reader / PyPDF2)
I've paste an image as the results when pasting in github is different: تجربة استخالص نص من ملفpdf حيث تهتم هذه التجربة بتقييم جودة التحويل الخاص بمكتبة pypdf2 .وكيف تستخلص اللغة العربية في هذا السطر ستكون الكتابة كلها عربية، وسوف أستخدم عالمات الترقيم العربية أيضا، فيا ترى كيف ستكون الجودة هنا؟ • أما هذا السطر سوف أحاول أن أدخل بعض الكلمات بلغات أخرى مثل enter باللغة اإلنجليزية english language والتي يقصد .بها دخول وفي هذا السطر سوف أحاول أن أكتب كلمات عربية ملتصقة مع أرقام على سبيل المثال محمد21 وناصر58 و123 محمود و 54سامي85 أما هنا سوف تكون الفواصل عالمات وليس أرقام كمثال محمد- _الجزائري كما هنالك ناصر_العربي وأحمد abcd . at the end hope this pdf sample is good for testing the rtl language order .
What I've noticed:
Can you check with this proposal also for my understanding Am I Right in the way to read the sentence?
@pubpub-zz take a look to this image it from notepad the first line is normal arabic text but it is ltr direction, this is not a problem because we can read it at least but the second line is from pypdf2 as you can see here the text is ltr but not as the first, here the words couldn't read, because the order is totally reversed where مرحبا become ابحرم i think the only way we can use is to check for the rtl chars and try to reverse them i know it will take more time to load but i think it can be an optional param in the extract_text function because i tried a lot searching how to do that and what is the best way, i couldn't find any thing may solve this problem without checking chars one by one or word by word
@baaziznasser, have you tried to deploy the PR #1305 ? if not using git, just replace _pages.py with https://raw.githubusercontent.com/py-pdf/PyPDF2/5ec7f67d6a7cda4b989128eed4c93b3ef6632a75/PyPDF2/_page.py
@pubpub-zz am realy sorry, i just tested out the rp for my test it's worked very very very well every word is in it place
give me some time i'll test it with more files to try to find if there is any problems
Good!!!! 🎉🎉🎉🎊
while I clean up can you do a further testing. Can you also give your position what I've noticed
What I've noticed:
* on line 1 "pdf" is on right and not on left (from direct PDF viewing) (and at the correct position when pasting in web page) : the text is in a separate group : no solution yet found as the text is already append to buffer... * on line 3 the dot is on left not on right: Because it is the first character on line it is considered LTR and then stays on left : currently no solutions identified. * some spaces (due to text groups readjustments) are introduced.
Can you check with this proposal also for my understanding Am I Right in the way to read the sentence?
Thanks
@pubpub-zz sorry i forget to reply to your message. so sir, about your reading, yes you are reading the text in the currect way and about your notes : in line one the (pdf) word as you said. in line 3 i made an error i wrote a big line, so when it converted to pdf the adobe acrobat dc devided the line and when it did word rapping the (.) became in the left.
@pubpub-zz to be onest right now the pypdf2 is the same like adobe acrobat and google chrome i tested the library with 10 pdfs and i got the same result. there is some things if you do them the pypdf2 will be better than all readers, i know that it not easy but at least we can try. so in the rtl languages and specificly in arabic language there are no vowels, we are using some thing called tashkill and the simbols are (ًٌٍَُِّْ) when the pdf have this simbols the text will be full of errors on pdf viewers. i will change the pdf_test file to add this simbols to it to know how it become. also i find a library named "arabic_reshaper" you can take a look to it, this can deal with these simbols. but as i said, now the library result is the same of what we get from adobe reader and google chrome
here is the pdf test
@pubpub-zz here is the result from adobe acrobat
and here is the result from the pypdf2
i found that the pypdf2 is better than adobe here and to make the test better here a result from google chrome
This is what I get : Viewer Copy and Paste from Acrobat Reader and with PyPDF2:
Too difficult for me to know if it is good or bad...
can you clarify in what google Chrome (PDF.js actually) is better ? is it also better after copy/paste
@pubpub-zz REALY AM SORRY AM BOTHRING YOU, but i want to make this very good library working well with arabic language as it is with the english language, so there is a problem with the tashkil (َ ً ُ ٌٌ ْ ِ ٍ ّ ـ) if you notesed that in your image from the viewer كَيْفَ and in the other image it is (ﻛَﯾْ ﻓِﻲ) the currect is كَيْفَ. hope we can find a way to deal with these simbols.
No prob. can you check what is the results when you are copy/pasting from google Chrome ? The point is viewing and pasting are using 2 different source of transcoding and we can not use the way "Display" is working.
@pubpub-zz the google chrome displaying text is good. but when we select and copy the text, it not organized at least for screen readers, this problem is only when the tashkil simbols is used.
i think if we can not deal with tashkil we can make an option to replace tashkil with nothing, maybe the problem will solved. for sure that is not recomended because there is some text must be with these simbols to be understandable.
I may have an explanation :
I started some test with " ْ " (sukun ?)
when I paste this character in github page editor and then I type a character (even a roman character) I get the letter is draw at the same vertical position : example " ْd "
but when I do the same in notepad the characters are disjoined
i think if we can not deal with tashkil we can make an option to replace tashkil with nothing, maybe the problem will solved. for sure that is not recomended because there is some text must be with these simbols to be understandable.
I dislike the idea of changing the extraction : this can be done by a simple replace after no ? @MartinThoma your opinion
@pubpub-zz yes notepad or any none rich edit will mmerge between the arabic char and the tashkil, because as i said the tashkil like fatha and sucon and casra they used like the A and e and i and o and u when used with the english ltrs e.g فَ read fa فِ = fi فُ = fo ... so the notepad deal with them like a singol letter. but any richedit will siparate them
.... so the notepad deal with them like a singol letter...
Don't think so... This is 2 characters. is is not like for french ^+e = ê ("\u00E4") which do be a single character
@pubpub-zz yes, am talking about the display only, if for example tried to move using the left and right keys or try to select a char the tashkil will also selected with it, but if you try that wit microsoft word or any richedit the char will be selected without the tashkil.
There is a big problem with arabic text extraction.
If we have a string that says (مرحبا هذه تجربة) the PyPDF2 extract_text function returned it like : (ةبرجت هذه ابحرم).
Environment
Code + PDF
This is a minimal, complete example that shows the issue with file.pdf:
It gives:
but it's partially reversed, e.g. the beginning
should be