reingart / pyfpdf_googlecode

Automatically exported from code.google.com/p/pyfpdf
GNU Lesser General Public License v3.0
0 stars 0 forks source link

Adobe Acrobat does not show one special utf-8 character. Other PDF-viewers do it well. #41

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
1. create PDF with the attached file 'unicodeSimple.py' (this file is utf-8 
coded).
2. open the created unicode-simple.pdf (also attached) with different 
pdf-viewers.

Any of these Adobe products show U+010A instead of U+010D:
- Adobe Reader X on Windows 7
- Adobe Reader XI on Windows 7
- Adobe Acrobat Pro X on Windows 7
- Adobe Acrobat Reader 5.0 on HP-UX

Any of these products show U+010D as expected:
- Google Chrome 24.0 on Windows 7
- Foxit Reader 5.4 on Windows 7
- Evince on linux
- Gimp 2.6 import on linux

As far as I can see, any other utf symbol (Latin-1 Supplement, Latin 
extended-A) is shown as expected by all viewers.

I use pyfpdf 1.7 with python 2.7 on windows 7.
I got same results with these fonts:
- DejaVuSansCondensed
- GNU FreeFont
- MS Arial 5.1

I checked the pdf-file with the Acrobat Pro X preflight-tool and did not find 
any problem or warning.

I don't know, is it a bug in Acrobat or in pyfpdf?

Original issue reported on code.google.com by edwin.ce...@liebherr.com on 17 Jan 2013 at 3:10

Attachments:

GoogleCodeExporter commented 9 years ago
Problem solved.

Instead of 
    ...
    def _escape(self, s):
        #Add \ before \, ( and )
        return s.replace('\\','\\\\').replace(')','\\)').replace('(','\\(')
    ...

use
    ...
   def _escape(self, s):
        # Add \ before \, ( and )
        return s.replace('\\', '\\\\').replace(')', '\\)').replace('(', '\\(').replace('\r', '\\r')
    ...

You may also compare to function _escape() in tfpdf 
(http://fpdf.org/en/script/script92.php).

Would you be so kind to check this issue? Possibly this modification could be 
taken into the sources.

Original comment by edwin.ce...@liebherr.com on 22 Jan 2013 at 1:45

GoogleCodeExporter commented 9 years ago
Are you sure that \x0d\x0a pair in source text lead to this issue?

Is anybody know some PDF reader for linux to test this issue? (except Adobe 
Reader)

Original comment by romiq...@gmail.com on 22 Jan 2013 at 2:00

GoogleCodeExporter commented 9 years ago
1.    Are you sure that \x0d\x0a pair in source text lead to this issue?
yes, really sure. You may create a pdf with only these symbols
       """
       U+010A -- Ċ -- Latin Capital Letter C with dot above
       U+010D -- č -- Latin Small Letter C with caron
       U+010E -- Ď -- Latin Capital Letter D with caron
       """.
Do it once with pyfpdf and once with tfpdf (under php). Compare the resultant 
pdf docs with a hex-diff-tool (i took gvim on windows). I recommend to set 
compression off. 

Then you may compare the sources of tfpdf's function _escape() to pyfpdf's def 
_escape().

Additionally you may search fpdf.py for "txt2 = self._escape(UTF8ToUTF16BE(txt, 
False))" and log the text before and after UTF8ToUTF16BE and after 
self._escape. You may do this once in pyfpdf and once in tfpdf.

I checked the new pdf (with \\r instead of \r) 
on Windows with 
    - Adobe Acrobat Reader XI, 
    - FoxIt Reader 5.4, 
    - Google Chrome, 
on Linux with
    - evince, 
    - gimp import,
    - xpdf
on HP-UX with
    - Adobe Acrobat Reader 5.0,
and on Android with
    - Polaris office
All readers show symbol U+010D as expected.

2.    Is anybody know some PDF reader for linux to test this issue? (except 
Adobe Reader)
I took xpdf and evince.

Original comment by edwin.ce...@liebherr.com on 22 Jan 2013 at 3:25

GoogleCodeExporter commented 9 years ago
Edwin, still not sure if escaping is right way. 

But references (both 1.3 and 1.7) didn't specify how to properly integrate 
unicode string literal into text object (ie mix 7-bit string with UTF16BE).
Ref specify only BOM mark for all text string, not for one text literal.

Ref 1.3 also complain about encrypted 8-bit literals, but who care.

Please test this patch. (small refactor and proposed change)

1. http://www.adobe.com/devnet/pdf/pdf_reference_archive.html

Original comment by romiq...@gmail.com on 23 Jan 2013 at 7:22

Attachments:

GoogleCodeExporter commented 9 years ago
Hello,

unfortunately I worked on this issue not by forward engineering but with 
reverse engineering (just to get a solution in an acceptable amount of time). 
Therefore I did not check the 700 pages 1.3 spec before.

I generated the document with python 2.7, pyfpdf 1.7 and your patch. Symbol 
U+010D (and also all symbols from Latin-1 Supplement and Latin extended-A) are 
shown as expected with this viewers:

on Windows 7 with 
    - Adobe Acrobat Reader XI,
    - Adobe Acrobat Pro X,
    - FoxIt Reader 5.4,
    - Google Chrome,
on Linux with
    - evince,
    - gimp import,
    - xpdf,
on HP-UX with
    - Adobe Acrobat Reader 5.0,

Original comment by edwin.ce...@liebherr.com on 23 Jan 2013 at 11:10

GoogleCodeExporter commented 9 years ago
Edwin, 1.7 ref is about 1300, by the way :) I also didn't read full spec 
carefully. yet.

I tested this patch with DejaVuSansCondensed, DroidSans an Ununtu-R fonts in
 * evince (poppler library)
 * atril (evince fork)
 * Adobe Reader for android (This app also show U+010A instead of U+010D)
Look nice.

If there is no objection - i'll merge this patch.

Original comment by romiq...@gmail.com on 23 Jan 2013 at 11:27

GoogleCodeExporter commented 9 years ago
I appreciate it.

Thank you!

Original comment by edwin.ce...@liebherr.com on 23 Jan 2013 at 12:25

GoogleCodeExporter commented 9 years ago
Committed.

Edwin, thank you for reporting and proposed solution.

Original comment by romiq...@gmail.com on 25 Jan 2013 at 5:30