pymupdf / PyMuPDF

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
https://pymupdf.readthedocs.io
GNU Affero General Public License v3.0
4.49k stars 443 forks source link

insert_htmlbox does not print out characters if there is a mix of non English characters and English characters #3605

Closed smithah closed 1 week ago

smithah commented 1 week ago

Description of the bug

Trying to write this text - "अधिक जानकारी के लिए customerservice@axismf.com। निवेशकों को केवल पंजीकृत म्यूचुअल फंड से ही लेनदेन करना चाहिए, जिसका विवरण www.sebi.gov.in पर उपलब्ध है -" to PDF getting blank in the space where English characters is present.

page.insert_htmlbox( # page is a PDF Page object (32.649353027344, 688.83270263672, 537.9233528971674, 703.54669189453), # rectangle inside the page "अधिक जानकारी के लिए customerservice@axismf.com। निवेशकों को केवल पंजीकृत म्यूचुअल फंड से ही लेनदेन करना चाहिए, जिसका विवरण www.sebi.gov.in पर उपलब्ध है -", # text string or a Story object css="body {font-size:7pt;font-family:Noto Sans Devanagari Regular;color:#373334;text-decoration: underline;}",
scale_low=0, # limit scaling down when fitting content archive=None, # points to locations of fonts and images rotate=0, # clockwise rotate content by this angle oc=0, # assign xref of an OCG (conditional visibility) opacity=1, # make content transparent (default: 1 = no) overlay=True

put in foreground (default) or background

                                        )  

The text doesn't getting rendered tired to open the PDF is Acrobat reader and Chrome Browser as well, the text though present is not visible Need help on this. Thanks

image

How to reproduce the bug

Trying to write this text - "अधिक जानकारी के लिए customerservice@axismf.com। निवेशकों को केवल पंजीकृत म्यूचुअल फंड से ही लेनदेन करना चाहिए, जिसका विवरण www.sebi.gov.in पर उपलब्ध है -" to PDF getting blank in the space where English characters is present.

page.insert_htmlbox( # page is a PDF Page object (32.649353027344, 688.83270263672, 537.9233528971674, 703.54669189453), # rectangle inside the page "अधिक जानकारी के लिए customerservice@axismf.com। निवेशकों को केवल पंजीकृत म्यूचुअल फंड से ही लेनदेन करना चाहिए, जिसका विवरण www.sebi.gov.in पर उपलब्ध है -", # text string or a Story object css="body {font-size:7pt;font-family:Noto Sans Devanagari Regular;color:#373334;text-decoration: underline;}",
scale_low=0, # limit scaling down when fitting content archive=None, # points to locations of fonts and images rotate=0, # clockwise rotate content by this angle oc=0, # assign xref of an OCG (conditional visibility) opacity=1, # make content transparent (default: 1 = no) overlay=True

put in foreground (default) or background

                                        )  

The text doesn't getting rendered tired to open the PDF is Acrobat reader and Chrome Browser as well, the text though present is not visible Need help on this. Thanks

image

PyMuPDF version

1.24.5

Operating system

Windows

Python version

3.10

smithah commented 1 week ago

I have also used page.clean_contents(sanitize=True) after each insert_htmlbox, still it does not print this line properly

smithah commented 1 week ago

"अधिक जानकारी के लिए customerservice@axismf.com निवेशकों को केवल पंजीकृत म्यूचुअल फंड से ही लेनदेन करना चाहिए, जिसका विवरण www.sebi.gov.in पर उपलब्ध है -"

Its because of the highlighted in bold character in the text string.

smithah commented 1 week ago

The pipe character if present in the html text to be inserted replaced it with - Alternatively, use one of the HTML entities for the pipe character, e.g. | (or the more meaningful |

Now its printing the characters correctly as given below. Is there any other way to handle special characters in the insert_htmlbox while printing text with special characters. Thanks

image

smithah commented 1 week ago

| or | using either of the two html entities instead of the pipe symbol solves the issue for now, finding the symbol in the text and doing a text replace with either of the above html entites solves the issue as of now.

JorjMcKie commented 1 week ago

I must confess that I am confused. Are you reporting an issue or not? In the end you seem to say that you found the correct way to achieve what you want.

In any case I see no problem caused by any of PyMuPDF or MuPDF. At best, this is a Discussions item therefore, I am going to transfer this.