Closed flintforge closed 1 year ago
Was input passed to gumbo properly encoded as utf-8, as gumbo only supports utf-8? Was proper length in bytes after encoding as utf-8 properly calculated?
The problem is when parsing combining diacritics. Otherwise the handling of Unicode is fine.
Did you run a unicode normalization routine on your unicode string before converting it to utf-8? I believe html5 recommends "precomposed" NFC unicode normalization for all text. If you pass it text in NFD form when it expects NFC or visa-versa after converting it to utf-8 your byte counts will differ.
See python3's unicodedata module and try form NFC and then NFD to see what impact it has on the issue.
unicodedata.normalize(form, unistr)¶
Hi again, I've been looking into that. The bytes given on the standard input are so far correct on the python part.
echo -n "ABCὁ´XYZ" | python3 -c '
import sys, gumbo
from unicodedata import normalize
text = ""
for line in sys.stdin: text += line
print("\n"," ",bytes(text.encode("UTF8")), line, "\n")
for i,form in enumerate(["NFC", "NFKC", "NFD", "NFKD"]):
nrmtext = normalize(form,text)
with gumbo.gumboc.parse(nrmtext) as output:
print(output.contents.root.contents)
print(i," ",bytes(nrmtext.encode("UTF8")), line, form)
'
outputs:
b'ABC\xce\xbf\xcc\x94 \xcc\x81XYZ' ABCὁ´XYZ
<HTML>
<HEAD>
</HEAD>
<BODY>
Text(b'ABC\xe1\xbd\x81 \xcc\x81')</BODY></HTML>
0 b'ABC\xe1\xbd\x81 \xcc\x81XYZ' ABCὁ´XYZ NFC
<HTML>
<HEAD>
</HEAD>
<BODY>
Text(b'ABC\xe1\xbd\x81 \xcc\x81')</BODY></HTML>
1 b'ABC\xe1\xbd\x81 \xcc\x81XYZ' ABCὁ´XYZ NFKC
<HTML>
<HEAD>
</HEAD>
<BODY>
Text(b'ABC\xce\xbf\xcc\x94 \xcc\x81')</BODY></HTML>
2 b'ABC\xce\xbf\xcc\x94 \xcc\x81XYZ' ABCὁ´XYZ NFD
<HTML>
<HEAD>
</HEAD>
<BODY>
Text(b'ABC\xce\xbf\xcc\x94 \xcc\x81')</BODY></HTML>
3 b'ABC\xce\xbf\xcc\x94 \xcc\x81XYZ' ABCὁ´XYZ NFKD
The only thing happening here before gumbo starts, is that the combining diacritics get converted to their corresponding Unicode. The text string will display correctly when copy/pasted in a text editor or a shell.
Bug is in gumboc.py here:
https://github.com/google/gumbo-parser/blob/master/python/gumbo/gumboc.py#L388
They pass in len(text) and not len(text.encode('utf-8')
Try changing that and all should work correctly.
Just so you are aware, no one official has responded to anyone for years so this project is effectively abandoned. Most of us have created our own forks for our own projects and have fixed the reported bugs and added features.
My project, Sigil-Ebook has created a sigil gumbo fork. Our gumboc.py version does not have this bug which is why I could not recreate it.
You might want to consider either forking this project like we did and creating your own or if you are looking for a supported html5 gumbo based parser for python, you might want to consider:
https://github.com/kovidgoyal/html5-parser
Good luck!
I met this error with diacritical accents from the Greek alphabet
outputs
Notice two characters are missing in the end
Tested with python 3.5 & 3.8, libgumbo1.0.0 (gumbo-0.10.1-py3.8.egg)