bullet in <ul> incompatible with utf-8 encoding

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?
1. with web2py, the setup is similar to issue #65 that I reported earlier
2. ensure that html_write(html) in web2py's views/generic.pdf produces html as 
a unicode UTF-8 object that includes <UL> tags
3.

What is the expected output? What do you see instead?
A web2py error traceback; further down I explain what's going on. Traceback 
here:

<type 'exceptions.UnicodeDecodeError'> 'utf8' codec can't decode byte 0x95 in 
position 5: invalid start byte

Traceback (most recent call last):
  File "...\gluon\restricted.py", line 217, in restricted
    exec ccode in environment
  File "...\[app]\views\generic.pdf", line 13, in <module>
    # note the pyfpdf instead of pdf in the function call
  File "...\[app]\modules\util.py", line 498, in pyfpdf_from_html
    pdf.write_html(html, image_map=image_map)
  File "...\gluon\contrib\fpdf\html.py", line 397, in write_html
    h2p.feed(text)
  File "...\python\lib\HTMLParser.py", line 114, in feed
    self.goahead(0)
  File "...\python\lib\HTMLParser.py", line 158, in goahead
    k = self.parse_starttag(i)
  File "...\python\lib\HTMLParser.py", line 324, in parse_starttag
    self.handle_starttag(tag, attrs)
  File "...\gluon\contrib\fpdf\html.py", line 213, in handle_starttag
    self.pdf.write(self.h,'%s%s ' % (' '*5*self.indent, bullet))
  File "...\gluon\contrib\fpdf\fpdf.py", line 837, in write
    txt = self.normalize_text(txt)
  File "...\gluon\contrib\fpdf\fpdf.py", line 1043, in normalize_text
    txt = txt.decode('utf8')
  File "...\python\lib\encodings\utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x95 in position 5: invalid 
start byte

What version of the product are you using? On what operating system?
version 1.7.1 bundled with web2py 2.6.4-stable+timestamp.2013.09.27.13.07.13 
(Rocket 1.2.6, Python 2.7.5) running on Windows XP Pro SP3 

Please provide any additional information below.
I identified the root cause in gluon/contrib/fpdf/html.py line 201

        if tag=='ul':
            self.indent+=1
            self.bullet.append('\x95')         # line 201

'\x95' triggers the utf8 codec error. As a temporary work-around I have 
replaced '\x95' with '*' and now the pdf view renders without errors.

A permanent fix should probably set the bullet object as '\x95' when latin-1 
encoding is in effect, and as a suitable utf-8 character when utf-8 encodings 
is in effect.

Original issue reported on code.google.com by step.l...@gmail.com on 28 Oct 2013 at 6:25

GoogleCodeExporter commented 9 years ago

Please find complete steps to reproduce this issue here: 
https://groups.google.com/d/msg/web2py/2YlRfHhvOvY/j_YWtsMlBtgJ

Original comment by step.l...@gmail.com on 29 Oct 2013 at 5:43

GoogleCodeExporter commented 9 years ago


Thanks, I'll see this ASAP, what character code do you suggest for UTF8?

Original comment by reingart@gmail.com on 29 Oct 2013 at 6:04

Changed state: Accepted

GoogleCodeExporter commented 9 years ago

I'd probably go with the standard bullet
u'\u2022'

see also 
http://en.wikipedia.org/wiki/Bullet_%28typography%29?section=3#Computer_encoding
_and_keyboard_entry

Original comment by step.l...@gmail.com on 29 Oct 2013 at 6:47

GoogleCodeExporter commented 9 years ago

I just checked font Trebuchet MS on Windows XP SP3 and it does have the unicode 
bullet character at 0x2022. This is not to say that all fonts that ship with 
Windows do, but at least a very common one does.

Original comment by step.l...@gmail.com on 30 Oct 2013 at 12:52

reingart / pyfpdf_googlecode

bullet in <ul> incompatible with utf-8 encoding #66