reingart / pyfpdf

Simple PDF generation for Python (FPDF PHP port)
https://code.google.com/p/pyfpdf/
GNU Lesser General Public License v3.0
857 stars 526 forks source link

Trying to add non-ASCII text, but getting error message about encoding. #86

Open sven-oly opened 7 years ago

sven-oly commented 7 years ago

UnicodeEncodeError: 'latin-1' codec can't encode character u'\u201d' in position 10: ordinal not in range(256)

alexanderankin commented 7 years ago

try checking out the tests for a whole bunch of weird characters, or hello world in many languages.

maybe you could share a snippet of some code?

epalm commented 7 years ago

I'm seeing this as well:

UnicodeEncodeError 'ascii' codec can't encode character u'\xe9' in position 3: ordinal not in range(128)

The exception is raised in the cell function: pdf.cell(txt=name) # where name is José

Without the "é" works fine, of course: pdf.cell(txt=name) # where name is Jose

The django debug error page shows the name variable as: u'Jos\xe9'

epalm commented 7 years ago

I can get around the problem but doing the following:

import unicodedata

def unicode_normalize(s):
    return unicodedata.normalize('NFKD', s).encode('ascii', 'ignore')

pdf.cell(txt=unicode_normalize(name)) # where name is José

# The approximate ascii "Jose" is printed on the PDF

Not ideal, as we're losing accents, but at least (a) it doesn't crash, and (b) we see something resembling the string we want.

What is The Right Way to handle this?

Edit: By the way, if I use name.encode('utf-8'), no exception is raised, but "José" is printed on the pdf.

alexanderankin commented 7 years ago

seems to work fine for me on version 2? lmk if this version works.

epalm commented 7 years ago

Ah I'm using 1.7.2. When will 2.0.0 be available on pypi?

alexanderankin commented 7 years ago

check link

alexanderankin commented 7 years ago

Oh so i just looked at the file produced and it doesnt look like the character is actually making it through, I'll make some test cases and see how this works. I haven't spent on this project in a while, so while that means its easy for me to switch gears from ttf and image management, im also a bit busy with other things.

RomanKharin commented 7 years ago

epalm, Currently we accept string for py3 and string/unicode for py2 version. It shouldn't 'eat' utf-8 encoded sequences. Can you provide striped down version of this problem.

There are only roadmap to massive update. Current policy is full compatibility.

epalm commented 7 years ago

I'm not exactly sure how to reproduce this. When the variable comes from my database (postgres) and application (django), I get the above UnicodeEncodeError.

When I do it in a shell, I get an AttributeError:

Python 2.7.12 (v2.7.12:d33e0cf91556, Jun 27 2016, 15:19:22) [MSC v.1500 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import fpdf
>>> fpdf.FPDF_VERSION
'1.7.2'
>>> from fpdf import FPDF
>>> pdf = FPDF(format='letter')
>>> pdf.add_page()
>>> pdf.cell(0, txt='José') # same result with u'José' if that matters
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\project\lib\site-packages\fpdf\fpdf.py", line 150, in wrapper
    return fn(self, *args, **kwargs)
  File "C:\project\lib\site-packages\fpdf\fpdf.py", line 685, in cell
    txt = self.normalize_text(txt)
  File "C:\project\lib\site-packages\fpdf\fpdf.py", line 1099, in normalize_text
    if self.unifontsubset and isinstance(txt, str) and not PY3K:
AttributeError: 'FPDF' object has no attribute 'unifontsubset'
>>> 
RomanKharin commented 7 years ago

Well, this is programming error. We should add more descriptive error though. unifontsubset is not assigned until set_font is used.

alexanderankin commented 7 years ago

Would it be a good idea to initialize some defaults as part of constructor?

On Jun 16, 2017 2:08 PM, "Roman Kharin" notifications@github.com wrote:

Well, this is programming error. We should add more descriptive error though. unifontsubset is not assigned until set_font is used.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/reingart/pyfpdf/issues/86#issuecomment-309095776, or mute the thread https://github.com/notifications/unsubscribe-auth/AIgjJMpW5GPxjbh1m8kq-O4AauYwp7ULks5sEsSngaJpZM4NFP6F .

epalm commented 7 years ago

Oops, sorry, forgot to initialize a font.

Python 2.7.12 (v2.7.12:d33e0cf91556, Jun 27 2016, 15:19:22) [MSC v.1500 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import fpdf
>>> fpdf.FPDF_VERSION
'1.7.2'
>>> from fpdf import FPDF
>>> pdf = FPDF(format='letter')
>>> pdf.add_page()
>>> pdf.set_font('Arial', 'B', 14)
>>> pdf.cell(0, txt=u'José')
>>> pdf.output(name='file.pdf')

produces (this is expected): 2017-06-16 15_18_55-file pdf - microsoft edge

However If I use 'José' instead of u'José' I get: 2017-06-16 15_21_40-microsoft edge

I'm still not sure how to trigger UnicodeEncodeError from a shell.

RomanKharin commented 7 years ago

Ok, Note to all. This is example of good report. In between let me reread pdf_reference_1-7.pdf

RomanKharin commented 7 years ago

Eric, It seems to be due encoding issue Linux, 2.7, utf-8

>>> import sys
>>> >>> sys.stdout.encoding
'UTF-8'
>>> repr(u'José')
"u'Jos\\xe9'"
>>> repr('José')
"'Jos\\xc3\\xa9'"

Have no win access for now, but can you test same?

epalm commented 7 years ago

Sure:

Python 2.7.11 (v2.7.11:6d1b6a68f775, Dec  5 2015, 20:32:19) [MSC v.1500 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.stdout.encoding
'cp437'
>>> repr(u'José')
"u'Jos\\xe9'"
>>> repr('José')
"'Jos\\x82'"
>>>
RomanKharin commented 7 years ago

That is You assume send bytes in cp437 (é is 0x82, unicode 0x00E9), but pyfpdf translate them as win-1251 (default) and 0x82 is ',' (Unicode 0x201A). We already have to add setting for this case pdf.set_doc_option("core_fonts_encoding", 'cp437'). But it's really simple to use u"" (sounds lazy but it's 2017 so py3 maybe) and do not forget that final code page for non-unicode font is ''WinAnsiEncoding" i.e. has no some diacritics.

epalm commented 7 years ago

Sorry, I'm not sure what your conclusion is. I'm using python 2 on ubuntu 14.04 in production. Should I be calling pdf.set_doc_option in my code, with params that depend on my environment?

RomanKharin commented 7 years ago

We found why code works this way from console. But still has too little clues about you production environment. Currently i can recreate this variant: Django or some middleware return 'José' as utf-8 bytes, then .encode("latin-1") in pyfpdf give this error. Is this correct?

tejasj654 commented 7 years ago

I am not really sure whether this is related. I started facing an error in this line p = self.pages[n].encode("latin1") if PY3K else self.pages[n]

The error was because I was trying to insert a sign. I didn't search through the code to figure out why utf-8 was not used here. My solution was to replace latin1 with windows-1252.

Latin-1 is basically equivalent to ascii I guess. Any character above the usual 7bit (128) starts to throw errors. windows-1252 is a little better with support till 159. There was no other workaround in my source code to support this character. Let me know if there were any.

I do not think this would break anything in fPDF. So can anyone add it to source?

keldrom commented 5 years ago

@openskullbox I've tried your soluition but I've encountered some issues. The only substitution of the latin1 with windows-1252 produce an error of decoding:

UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 28: character maps to undefined

Did someone has a solution to this problem? I'd like to print some greek characters

sajjadafridi commented 5 years ago

call set_font after adding

pdf.add_font('kalpurush', '', "C:\Windows\Fonts\kalpurush.ttf", uni=True)
pdf.set_font('kalpurush', '', 14)
jpenaloza1211 commented 4 years ago

Hello, have the encoding errors been fixed for python 3.7+ ? I am using the most recent, fpdf 2.0.3. At least I think that is the most recent. And I keep getting errors with characters like the unicode dash (\u2013)

alexanderankin commented 4 years ago

So I have not really been able to get any unicode tests written because it is unclear and undocumented (aside from official font standards/adobe PDF documentation) what the cmap is, what it should be assigned to, nor have i been able to find any font libraries for python which have intuitive docs or address this use case.

On Wed, Feb 5, 2020 at 11:43 AM jpenaloza1211 notifications@github.com wrote:

Hello, have the encoding errors been fixed for python 3.7+ ? I am using the most recent, fpdf 2.0.3. At least I think that is the most recent. And I keep getting errors with characters like the unicode dash (\u2013)

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/reingart/pyfpdf/issues/86?email_source=notifications&email_token=ACECGJAJ4HVQ2OXZUYAOO73RBLUE3A5CNFSM4DIU72C2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEK4DV4Q#issuecomment-582499058, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACECGJFA5UYF2J2MYFLMOPDRBLUE3ANCNFSM4DIU72CQ .

ohidurbappy commented 4 years ago

Solved it as shown by @sajjadafridi
Downloaded Arial Unicode Regual font. And declared this :

pdf.add_font('ArialUnicode',fname='Arial-Unicode-Regular.ttf',uni=True)
pdf.set_font('ArialUnicode', '', 11)
Zawszy commented 4 years ago

@ohidurbappy where did you find the file for arial unicode regular? I can only find the font for the MS version and it doesnt seem to work.

ohidurbappy commented 4 years ago

Arial-Unicode-Regular.zip

@Zawszy I can't remember the link. Just attaching the file, in-case you need it.

pampam07 commented 3 years ago

Hi! There has been a recent change on using set_doc_option. It was deprecated a few days ago. You can check the release notes here: [https://github.com/PyFPDF/fpdf2/releases]

Now, without the set_doc_option, it says "the FPDF.set_doc_option() method is deprecated in favour of just setting the core_fonts_encoding property on an instance of FPDF."

I'm not sure what it means to an instance of FPDF but when I did the following: pdf = FPDF('P', 'mm', 'Legal').core_fonts_encoding('utf-8')

it produced an error: AttributeError: 'FPDF' object has no attribute 'core_fonts_encoding' Am I doing it wrong? Thanks, appreciate who can help.

vade commented 3 years ago

Have the same issue

file "pdf_report.py", line 13, in <module>
    pdf = FPDF(orientation = 'L', unit = 'mm', format = 'A4').core_fonts_encoding('utf-8')
AttributeError: 'FPDF' object has no attribute 'core_fonts_encoding'
Lucas-C commented 3 years ago

@pampam07 @vade PyFPDF is not maintained anymore, you may want to check PyFPDF/fpdf2 as a successor, with a 99%-compatible API

vade commented 3 years ago

Ah good to know. Thanks. FWIW I realized I wasnt defining / setting a font - doing that appears to have solved my issue(s)

msredovic319rn commented 2 years ago

Solved it as shown by @sajjadafridi Downloaded Arial Unicode Regual font. And declared this :

pdf.add_font('ArialUnicode',fname='Arial-Unicode-Regular.ttf',uni=True)
pdf.set_font('ArialUnicode', '', 11)

Thank you

lincolnneu commented 2 years ago

@Lucas-C does fpdf2 solve the encoding issue mentioned in this thread?

Lucas-C commented 2 years ago

I'm pretty sure yes 😊

The following works well with a source code file encoded as utf8:

#!/usr/bin/env python3
import fpdf
pdf = fpdf.FPDF()
pdf.add_page()
pdf.set_font("Helvetica", size=15)
pdf.cell(txt="José")
pdf.output("issue_86.pdf")

Same with the following, put in a source code file encoded as latin-1 ( ISO 8859-1):

#!/usr/bin/env python3
# -*- coding: latin-1 -*-
import fpdf
pdf = fpdf.FPDF()
pdf.add_page()
pdf.set_font("Helvetica", size=15)
pdf.cell(txt="José")
pdf.output("issue_86.pdf")
cseberino commented 2 years ago

Lucas-C's code works great with fpdf2 until you try different Unicode characters. If you replace "José" with "Joséō" you get the error below..

!/usr/bin/env python3

import fpdf pdf = fpdf.FPDF() pdf.add_page() pdf.set_font("Helvetica", size=15) pdf.cell(txt="Joséō"") pdf.output("issue_86.pdf")

UnicodeEncodeError: 'latin-1' codec can't encode character '\u014d' in position 4: ordinal not in range(256)

Is there a solution that works for arbitrary Unicode characters?

cseberino commented 2 years ago

Adding the Arial Unicode font works in the mean time.

Lucas-C commented 2 years ago

OK, thank you @cseberino. I reported the issue here for fpdf2: https://github.com/PyFPDF/fpdf2/issues/330