hoehermann / pypdf_strreplace

Search and replace text in PDF files with PyPDF.
Other
20 stars 2 forks source link

Encoding / different language fails #4

Open guyromb opened 3 months ago

guyromb commented 3 months ago

Hello,

It is currently working with replacing English string with English string. However, I am trying to replace English string with a foreign language (E.g. Hebrew or Greek).

but then it fails:

  File "/Users/x/projects/pypdf_strreplace/pypdf_strreplace.py", line 287, in <module>
    total_replacements += replace_text(contents, charmaps, args.search, args.replace)
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/x/projects/pypdf_strreplace/pypdf_strreplace.py", line 238, in replace_text
    text_maps = [op.get_text_map(charmaps) for op in operations]
                 ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/x/projects/pypdf_strreplace/pypdf_strreplace.py", line 165, in get_text_map
    return [MappedOperand(self, self.operands[0], charmaps[self.context.font].decode(self.operands[0]))]
                                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/x/projects/pypdf_strreplace/pypdf_strreplace.py", line 38, in decode
    return "".join(text.decode(self.encoding).translate(str.maketrans(self.map)))
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: decode() argument 'encoding' must be str, not dict

Any idea how to solve it? Thank you

Command: python3 pypdf_strreplace.py --input pdfs/test.pdf --search "hello" --replace "שלום" --output out.pdf

hoehermann commented 3 months ago

PDF features a lot of different methods to represent text. This tool is limited to only a small number of those methods. I implement them as I need them. It looks like I did not cover whatever your file demands. Can you send me a sample document?

It is totally possible that what you want cannot be done with this tool. As far as I know, Hebrew is right to left. It is hard enough with plain-text already and I have no idea how PDF handles it. 😅

Please also keep in mind that for most documents (all those with sub-setted fonts), the glyphs you want to insert must already be present in the document. This tool cannot import glyphs from your system into the document. 😕

guyromb commented 3 months ago

Thank you for the quick reply.

Here is an example: https://drive.google.com/file/d/1mA4F4i0NDsP9FEgmvKDYmNs3Nbl-cvlu/view?usp=sharing python3 pypdf_strreplace.py --input pdfs/test.pdf --search "koby" --replace "קובי" --output out.pdf

I added a-z א-ת 0-9 at the bottom of the pdf page so all of those should be recognise

hoehermann commented 3 months ago

Thank you for providing the sample document. I was able to fix the decode issue, but that does not help much for your particular goal. The problem is that "koby" is set with font ArialMT. The Hebrew letters are set with MyriadHebrew-Regular. Although it is definitely possible to switch fonts for replacing, it is not implemented in this tool. I cannot see myself having so much spare time I can implement it anytime soon. 🙁

In case you are in control of how the PDF document is designed, maybe you can select one font that covers both the Latin and the Hebrew letters. DejaVu perhaps? Then this tool has a chance of working. 🙂 If your design software allows it, disable "font subsetting" – that might help, too.

The font Assistant-SemiBold may stay since the text in the logo is not going to be manipulated.

guyromb commented 3 months ago

Thank you so much! I'll give it a try :)

guyromb commented 2 months ago

Hey @hoehermann , I tried as you instructed. However, regardless of the font - I am getting this error:

python3 pypdf_strreplace.py --input pdfs/koby_levi2.pdf --search "koby" --replace "guy" --output outkoby.pdf
b'k'
Traceback (most recent call last):
  File "/Users/guyromb/projects/pypdf_strreplace/pypdf_strreplace.py", line 291, in <module>
    total_replacements += replace_text(contents, charmaps, args.search, args.replace)
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/guyromb/projects/pypdf_strreplace/pypdf_strreplace.py", line 239, in replace_text
    text_maps = [op.get_text_map(charmaps) for op in operations]
                 ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/guyromb/projects/pypdf_strreplace/pypdf_strreplace.py", line 140, in get_text_map
    map.append(MappedOperand(self, operand, charmaps[self.context.font].decode(operand)))
                                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/guyromb/projects/pypdf_strreplace/pypdf_strreplace.py", line 39, in decode
    return "".join(text.get_original_bytes().decode(self.encoding).translate(str.maketrans(self.map)))
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/python@3.12/3.12.5/Frameworks/Python.framework/Versions/3.12/lib/python3.12/encodings/utf_16_be.py", line 16, in decode
    return codecs.utf_16_be_decode(input, errors, True)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'utf-16-be' codec can't decode byte 0x6b in position 0: truncated data
decoding with 'utf-16-be' codec failed

The file I am using: https://drive.google.com/file/d/1-8fZtBeLf-VFFZL45AN329AlOXo_ZvYP/view?usp=sharing

hoehermann commented 2 months ago

I put a couple of hours into this. Unfortunately, I could not get it to work. Eventually, I was able to read the text in https://github.com/hoehermann/pypdf_strreplace/tree/debug-gui, but I have no idea how to write it for your particular file. I am sorry.

guyromb commented 2 months ago

Does it makes it easier if possible to support only hebrew to hebrew?

hoehermann commented 2 months ago

Maybe, but I am really not sure.

I have an idea though. PDF is a file format that is notoriously hard to edit. It is specifically made not to be edited, but only printed. Perhaps we should re-think the overall approach to your goal. I assume you want to automate some process that generates badges. Instead of replacing the text on an existing PDF, you could generate the PDF so it has the correct text in the first place.

I attached a koby_levi.zip with a text-less variant of your badge design (sorry, I do not have an Adobe Illustrator License and used Inkscape) and a .tex file that is supposed to be used with https://en.wikipedia.org/wiki/XeTeX. The tex file can be edited with a text editor and can contain Latin and Hebrew text. The numbers designate where the text should be put on the background. Probably needs some time to get the positions right, but once that is done, the text should be easy to maintain. :)

I hope this helps.

guyromb commented 2 months ago

Above solution is a bit tricky. Let me drop you an email with some more private details - and see if it's a match.