pymupdf / PyMuPDF

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
https://pymupdf.readthedocs.io
GNU Affero General Public License v3.0
5.02k stars 483 forks source link

TextWriter.writeText : Arabic texts are saved in the document from left to right instead of right-left #897

Closed ABanerji closed 3 years ago

ABanerji commented 3 years ago

TextWriter.writeText function, writes the arabic text from left-to-right in the output pdf. My input text is in correct format i.e. from right-to-left. But when I write this to pdf, it just reverses the whole string.

input_str = "دبي: طرح مجموعة جديدة من الإجراءات الاحترازية في دبي"

outfile shows: يبد يف ةيزارتحالا تاءارجإلا نم ةديدج ةعومجم حرط :يبد

Code snippet: def write_documents2(cord_list,translate_text,page): fill_rect = fz.Rect(cord_list) writer = fz.TextWriter(page.rect)

lp = writer.lastPoint

writer.fillTextbox(  # fill in above text
    fill_rect,  # keep text inside this
    translate_text,  # the text
    align=fz.TEXT_ALIGN_RIGHT,  # alignment
    warn=True,  # keep going if too much text
    fontsize=12,
    font=fz.Font("figo"),
    pos=fill_rect.top_right
)
writer.writeText(page)

doc.save("Translated_doc.pdf",garbage=4,deflate=True)

JorjMcKie commented 3 years ago

It has not been explicitely mentioned, but none of the TextWriter functions support right-to-left /top-to-bottom writing mode - yet.

ABanerji commented 3 years ago

Thanks for your reply. Can you please suggest if I can use any other writer in this library which can help to get the outfile having right-to-left text?

JorjMcKie commented 3 years ago

I want to support it! I am currently just undecided how to best do this. The problem is that you - depending on the used font - could write text consisting of an arbitrary mixture of Arabian, Chinese, Latin, ... pieces, e.g. using the font ~"Droid Sans Fallback"~ "FiraGO". And of course there is no way for PyMuPDF to do an analysis in that respect. So it must be you to cooperate here 😉. If a text however is purely right-to-left, then a solution may be to simply revert the characters ... per line. But it requires a new parameter to tell me that, e.g. ltr=True in the "normal" (Latin) case, ltr=False for Arabian, Hebrew, and others.

For the time being, you could try this out and use TextWriter.append(pos, "".join(reversed(text)), font=...). That should output the text correctly. For TextWriter.fill_textbox() I would then have to make a similar change: This method is nothing more that a series of append() calls with the prepared text for each line. So, every time I have composed the text that goes into the next line, I will check whether ltr == False and do the sequence reversion in that case.

Maybe even TextWriter.fill_textbox() already works that way. I will try that out as well.

JorjMcKie commented 3 years ago

Good news - I hope at least: using text="".join(reversed(input_str)) the following worked fine without changing anything: grafik

It will not work correctly if you do not specify a continuous string, but a list of strings, or a string which contains "\n" line breaks. So this is the change I need to make. For the time being however you should be able to work with reverting the character sequence as discussed.

ABanerji commented 3 years ago

Thanks! Jorj for looking into this and giving a workaround. I certainly owe you one 👍 Also as you mentioned, a new flag/flags to support the languages which do not follow left->right conventions can be a new feature to this tool.

ABanerji commented 3 years ago

Good news - I hope at least: using text="".join(reversed(input_str)) the following worked fine without changing anything: grafik

It will not work correctly if you do not specify a continuous string, but a list of strings, or a string which contains "\n" line breaks. So this is the change I need to make. For the time being however you should be able to work with reverting the character sequence as discussed.

Hi Jorj, though the reverse of string is happening but the statement is getting distorted in a sence, that you've to read it bottom-up.

also I'm getting below warnings a lot of text is now missing which was not the case earlier.

Warning: only fitting 1 of 11 total words. Warning: only fitting 1 of 11 total words. Warning: only fitting 1 of 10 total words.

ABanerji commented 3 years ago

Translated_doc.pdf Hi Jorj, another thing I wanted to highlight while using append 1) I loose control on the alignment. 2) the output text is now having ? at the end of each line. Can you suggest if I'm missing out on anything here. PFA translated doc for reference.

JorjMcKie commented 3 years ago

Ok - not as simple a solution as I hoped it to be.

Hi Jorj, though the reverse of string is happening but the statement is getting distorted in a sence, that you've to read it bottom-up.

You are right of course. No, I need to additionally cache the lines output in a list and reverse that list at the end, too.

1) I loose control on the alignment. (append)

Yes, alignment is only supported with fill_textbox - in any language. To use append you can circumvent this for now by first calculating the text length using text_length(text, fontsize=11) and then adjusting the insertion position accordingly. Later maybe, once we have a new ltr parameter, the position parameter would have to be the right coordinate. Hm, needs to be thought through.

2) the output text is now having ? at the end of each line.

Need to check this out.

JorjMcKie commented 3 years ago

@ABanerji - for deveoping a solution, can you please send me the text input you used for creating that PDF? Or the script even?

ABanerji commented 3 years ago

@ABanerji - for deveoping a solution, can you please send me the text input you used for creating that PDF? Or the script even?

Please find the text below.

{'Page_1': ['دبي: إدخال مجموعة جديدة من الإجراءات الاحترازية في دبي دفعته موجة من المخالفات الأخيرة استدعت اتباع نهج وقائي للفنادق ذات الإشغال المرتفع لتجنب تكرارها كشف مسؤول رفيع لأخبار الخليج', ' في مقابلة خاصة يوم الخميس, حدد أحمد الفلاسي, الرئيس التنفيذي لخدمات الشركات والاستثمار, إدارة السياحة والتسويق التجاري (دبي للسياحة), مجموعة الإجراءات الاحترازية الجديدة ضد COVID-19 التي انطلقت في 2 فبراير', ' وستكون الإجراءات التي أعلنتها اللجنة العليا لإدارة الأزمات والكوارث في دبي سارية المفعول حتى 28 فبراير الجاري إشغال فندق الفلاسي قال جميع المنشآت الفندقية ستعمل بمستوى إشغال 70 في المائة', ' بالنسبة للفنادق القائمة التي تعمل على مستوى الإشغال فوق 70 في المائة لا يتم إجراء أي حجوزات أو تمديدات جديدة حتى تلتزم الفنادق بالحد المذكور بالإضافة إلى تأجيل أي نشاط يؤدي إلى تجمعات كبيرة مثل فطور نهاية الأسبوع المتأخر', ' وقال, تم اعتماد الإجراءات الاحترازية الجديدة نتيجة المخالفات الأخيرة', ' تم اتباع نهج وقائي للفنادق ذات الإشغالات العالية لتجنب حدوث المخالفات', ' وحذر المسؤول من أن عدم الالتزام بالإجراءات الاحترازية سيترتب عليه تحرك جاد ضد المكان', ' سلامة ورفاهية الجميع هي أولويتنا القصوى ونتطلع إلى استمرار دعم الفنادق في دبي في ضمان الالتزام بكافة الإجراءات الاحترازية وأضاف', ' هل يمكن لحانات وحانات دبي تقديم الطعام ؟ وقال الفلاسي, إنه وفقا للإجراءات الاحترازية الجديدة التي وضعتها اللجنة العليا لإدارة الأزمات والكوارث في دبي, فإن جميع الحانات والبارات \x0c'], 'Page_2': ['يجب إغلاقه مؤقتًا من 2 فبراير حتى نهاية فبراير 2021', ' متطلبات اختبار PCR لسياحي دبي وقال يجب على جميع الركاب القادمين إلى دبي من أي نقطة منشأ (دول مجلس التعاون شملت) حمل شهادة اختبار COVID 19 PCR سلبية لاختبار يتم إجراؤه قبل المغادرة بما لا يزيد عن 72 ساعة', ' وقال دبي من أكثر مدن العالم أمانا للزيارة بمجموعة واسعة من البروتوكولات المعمول بها لضمان سلامة السياح في كل مرحلة ونقطة لمس رحلة سفرهم من الوصول إلى المغادرة', ' الإجراءات الاحترازية الفعالة التي نفذتها دبي أقرها المجلس العالمي للسفر والسياحة (WTTC) الذي أعطى المدينة طابع رحلات آمنة', ' تغطي بروتوكولات السفر الجوي جميع أنواع السياح الدوليين وتتضمن شهادة اختبار PCR سلبية يجب عليهم تقديمها عند الوصول إلى مطارات دبي مع إجراء الاختبار قبل المغادرة بما لا يزيد عن 72 ساعة, حسبما ذكر', ' تعقيم الفنادق وقال الفلاسي نواصل إلى جانب الجهات الأخرى ذات العلاقة إعطاء الأولوية والمراقبة والتأكد من سلامة ورفاه الجميع والعمل بشكل وثيق مع الجهات المعنية لدينا لضمان الالتزام بالبروتوكولات الصارمة الصادرة', ' وجدد التأكيد على إلزام جميع المنشآت الفندقية بما فيها منشآتها ومطاعمها بتطبيق التباعد الاجتماعي بين المطعمين وصانيتي المناطق المشتركة بشكل متكرر وتشجيع المدفوعات اللاتلامسية وحمل جميع العاملين على ارتداء كماماتهم فيما يمكن للمطعمين إزالة كماماتهم عند الجلوس على مائدتهم', ' \x0c'], 'Page_3': ['وقال إن دبي قدمت أيضا خاتم دبي المؤمن في يوليو بعد منحها خاتم الرحلات الآمنة من قبل WTTC (المجلس العالمي للسفر والسياحة)', ' ويعد الطابع بمثابة هوية مرئية تطمئن النزلاء بأنه تم الالتزام بكافة إجراءات السلامة والنظافة المقررة من الجهات عبر كافة مستويات وفئات نقاط اللمس السياحية والمقيمة كالفنادق والمعالم السياحية والمطاعم', ' "خاتم دبي المؤمن" ساري المفعول لمدة 15 يوما ويجدد تماشيا مع عملية التفتيش المنتظمة', ' وسنواصل العمل عن كثب مع شركائنا والقيام بعمليات تفتيش منتظمة للمواقع للتأكد من الالتزام بجميع بروتوكولات السلامة ، وعدم الامتثال سيسفر عن تحرك جدي ضد المؤسسة ، حسب قوله', ' في نهاية المطاف جميعنا مسؤولون عن التغلب على هذا الوقت المليء بالتحديات وأضاف دوز ولا للمقيمين والسياح جميع السكان والسياح مطالبون بالالتزام بكافة الإجراءات الاحترازية والأهم ارتداء كماماتهم في الأماكن العامة والحفاظ على التباعد الاجتماعي', ' \x0c']}

JorjMcKie commented 3 years ago

I am back with a first cut of the new logic. Your example text was very valuable, because it contains a mixture of Arabic and Latin. Helped me realize, that such a text is quite complex to handle, because the Latin text inside surrounding Arabic is still written left-to-right! Must be a pain in the neck for you guys ... Anyway, can you please check whether the 3-page PDF I created is correct? arabian.pdf

ABanerji commented 3 years ago

I am back with a first cut of the new logic. Your example text was very valuable, because it contains a mixture of Arabic and Latin. Helped me realize, that such a text is quite complex to handle, because the Latin text inside surrounding Arabic is still written left-to-right! Must be a pain in the neck for you guys ... Anyway, can you please check whether the 3-page PDF I created is correct? arabian.pdf

Hi Jorj, the output looks promising. Also the alignment looks good. So are there any major changes in the library or a new function added. When can you make the first cut available.

JorjMcKie commented 3 years ago

Hi Jorj, the output looks promising.

Ha! Great! I had to change TextWriter.fill_textbox of course. It will receive a new parameter right_to_left parameter, which has a default of False. Also the TextWriter.append will get this new parameter. Internally, I re-wrote fill_textbox. I am still missing support for TEXT_ALIGN_JUSTIFY, only left, center and right alignment are already working. Also missing is support for text containing words that are longer than the width of the textbox to be filled. Over the weekend, I can let you have a pre-version, so you can start using your own testing. I hope I can publish an official new version end of next week.

JorjMcKie commented 3 years ago

Please look here for Linux wheels, or here in 20 minutes for OSX wheels. I implemented the remaining features mentioned in previous post (long words, justified text). Please provide any feedback as soon as you can!

ABanerji commented 3 years ago

Please look here for Linux wheels, or here in 20 minutes for OSX wheels. I implemented the remaining features mentioned in previous post (long words, justified text). Please provide any feedback as soon as you can!

Thanks Jorj! I will test this tomorrow and let you know the feedback.

JorjMcKie commented 3 years ago

this is the script I used: arabian.zip

ABanerji commented 3 years ago

Hi Jorj, can you also create a whl for windows as well. I'm using win10, will the osx whl work to test this version?

JorjMcKie commented 3 years ago

Hi Jorj, can you also create a whl for windows as well. I'm using win10, will the osx whl work to test this version?

Yes to both questions. What is your Windows config? I will put a whl in this thread. 64bit? Pytohn version?

ABanerji commented 3 years ago

I am back with a first cut of the new logic. Your example text was very valuable, because it contains a mixture of Arabic and Latin. Helped me realize, that such a text is quite complex to handle, because the Latin text inside surrounding Arabic is still written left-to-right! Must be a pain in the neck for you guys ... Anyway, can you please check whether the 3-page PDF I created is correct? arabian.pdf

Hi Jorj, the output looks promising. Also the alignment looks good. So are there any major changes in the library or a new function added. When can you make the first cut available.

Hi Jorj, just a feedback on this. I gave this pdf to our arabic translator and here are the comments - 1) The alignment is accurate i.e. from left to right and line positioning is also maintained. 2) The input arabic text is not broken i.e. it is a word and not characters. While the output rendered is showing arabic characters which are not connected to form a proper word. I searched for this on internet and found that arabic reshaper can be used for these issues. But now the output a bit weird with wild chars in them. PFA output file. templateQA-Translated_doc.pdf

ABanerji commented 3 years ago

Hi Jorj, can you also create a whl for windows as well. I'm using win10, will the osx whl work to test this version?

Yes to both questions. What is your Windows config? I will put a whl in this thread. 64bit? Pytohn version?

I'm using win10 64bit and Python - 3.8.3

JorjMcKie commented 3 years ago

PyMuPDF-1.18.9-cp38-cp38-win_amd64.whl.zip

ABanerji commented 3 years ago

Hi Jorj,

this seems to be not working for me. In-fact it's not writing anything and it goes into an infinite loop. Below are steps to reproduce - 1) Im reading the uploaded file "template_QA.pdf" and extracting the page details as a "dict" -> page_dict = page.get_text("dict") 2) Then extract the block details for each block for bbox, size, text. 3) Then translate the english text to arabic using any public service. 4) Write the translated text back to the outfile with same bbox, font_size, color, bold/not-bold etc. My writer function - def write_documents2(rect, text, page, pos, font_name, font_size, color):

text=rev_sentence(translate_text)

fill_rect = fz.Rect(rect)
writer = fz.TextWriter(page.rect,color=color)

fnt_name = None
if "bold" in font_name.lower():
    fnt_name = "figbo"
elif "medi" in font_name.lower():
    fnt_name = "figbo"
elif "regu" in font_name.lower():
    fnt_name = "figo"
else:
    fnt_name = "figit"

ar_text = arabic_reshaper.reshape(text)
lt_text = [ar_text,"\n"]

print("AR SHAPER TEXT",ar_text)
print("LIST TEXT",lt_text)
try:
    writer.fillTextbox(  # fill in above text
        fill_rect,  # keep text inside this
        ar_text,  # the text
        align=fz.TEXT_ALIGN_RIGHT,  # alignment
        warn=True,  # keep going if too much text
        fontsize=font_size,
        font=fz.Font(fnt_name),
        right_to_left=True,
        #pos=point
    )
    writer.writeText(page)
except Exception as e:
    print("EXCEPTION DETAILS - ",e)
return

Observations - 1)When I try to put the new flag now, it doesn't seems to write anything. Im getting warnings as "Warning: Only fitting 0 of 2 lines." I also tried passing the text as a list . But it's not working. AR SHAPER TEXT ﻳﻈﻬﺮ ﺍﻟﺠﺪﻭﻝ 1 ﻗﺎﻟﺒﻴﻦ ﻟﻼﺳﺘﻌﻼﻡ ﻋﻦ ﺍﻟﻜﻠﻤﺎﺕ ﻳﺴﺘﺨﺪﻣﻬﻤﺎ Fader et LIST TEXT ['ﻳﻈﻬﺮ ﺍﻟﺠﺪﻭﻝ 1 ﻗﺎﻟﺒﻴﻦ ﻟﻼﺳﺘﻌﻼﻡ ﻋﻦ ﺍﻟﻜﻠﻤﺎﺕ ﻳﺴﺘﺨﺪﻣﻬﻤﺎ Fader et', '\n'] Warning: Only fitting 0 of 2 lines. template_QA.pdf

JorjMcKie commented 3 years ago

These issues seem not related to writing arabic text correctly as such.

  1. The script I sent back to you works? All characters in correct sequence, etc?
  2. If you see "Only fitting ..." then this means the provided rectangle is too small - nothing else. Obviously you cannot assume that the text translated to arabaic fits in the same space as the English before.
  3. Experiment with the fontsize: check font.text_length(text, fontsize=...) and choose a value such that text_length is less or equal the rectangle width you are providing.
JorjMcKie commented 3 years ago

Any news from you? I will publish the new version 1.18.9 over this weekend. So there is only little time left to include more changes.

JorjMcKie commented 3 years ago

Fixed in v1.18.9 currently being uploaded.

ABanerji commented 3 years ago

Hi Jorj, I read your message today. Was busy with the work. I guess the changes works fine, only thing is you need to adjust either the fonts or the block size. Also one more thing, for documents with 2 coloums of text, if translated. Then as per Arabic norms, these coloums should interchange position, i.e. the left becomes right and right becomes left. At present I'm writing a logic to shift the blocks by calculating the page's rect. Can you think of a better way to handle this in the library itself.

JorjMcKie commented 3 years ago

At present I'm writing a logic to shift the blocks by calculating the page's rect. Can you think of a better way to handle this in the library itself.

Layouting a page for text output always is the programmer's responsibility. That is true the same for left-to-right fonts. Not to forget about complications like header and footer which mostly take no part in the text colomns, or images around which text has to flow ... And then your situation is special in the sense that you translate from English original documents.