google / gumbo-parser

An HTML5 parsing library in pure C99
Apache License 2.0
5.16k stars 663 forks source link

Error on diacritical accents #422

Closed flintforge closed 1 year ago

flintforge commented 4 years ago

I met this error with diacritical accents from the Greek alphabet

echo "START῾´COMPLETE" | python3 -c '
import sys, gumbo
file = ""
for line in sys.stdin: file += line
with gumbo.gumboc.parse(file) as output:
    print(output.contents.root.contents)'

outputs

<HTML>
<HEAD>
</HEAD>
<BODY>
Text(b'START\xe1\xbf\xbe\xc2\xb4COMPLE')</BODY></HTML>

Notice two characters are missing in the end

Tested with python 3.5 & 3.8, libgumbo1.0.0 (gumbo-0.10.1-py3.8.egg)

kevinhendricks commented 4 years ago

Was input passed to gumbo properly encoded as utf-8, as gumbo only supports utf-8? Was proper length in bytes after encoding as utf-8 properly calculated?

flintforge commented 4 years ago

The problem is when parsing combining diacritics. Otherwise the handling of Unicode is fine.

kevinhendricks commented 4 years ago

Did you run a unicode normalization routine on your unicode string before converting it to utf-8? I believe html5 recommends "precomposed" NFC unicode normalization for all text. If you pass it text in NFD form when it expects NFC or visa-versa after converting it to utf-8 your byte counts will differ.

kevinhendricks commented 4 years ago

See python3's unicodedata module and try form NFC and then NFD to see what impact it has on the issue.

unicodedata.normalize(form, unistr)¶

flintforge commented 4 years ago

Hi again, I've been looking into that. The bytes given on the standard input are so far correct on the python part.

echo -n "ABCὁ´XYZ" | python3 -c '
import sys, gumbo
from unicodedata import normalize                     
text = ""
for line in sys.stdin: text += line
print("\n","   ",bytes(text.encode("UTF8")), line, "\n")
for i,form in enumerate(["NFC", "NFKC", "NFD", "NFKD"]):
    nrmtext = normalize(form,text)
    with gumbo.gumboc.parse(nrmtext) as output:
        print(output.contents.root.contents)
        print(i,"  ",bytes(nrmtext.encode("UTF8")), line, form)
'

outputs:

     b'ABC\xce\xbf\xcc\x94 \xcc\x81XYZ' ABCὁ´XYZ 

<HTML>
<HEAD>
</HEAD>
<BODY>
Text(b'ABC\xe1\xbd\x81 \xcc\x81')</BODY></HTML>
0    b'ABC\xe1\xbd\x81 \xcc\x81XYZ' ABCὁ´XYZ NFC
<HTML>
<HEAD>
</HEAD>
<BODY>
Text(b'ABC\xe1\xbd\x81 \xcc\x81')</BODY></HTML>
1    b'ABC\xe1\xbd\x81 \xcc\x81XYZ' ABCὁ´XYZ NFKC
<HTML>
<HEAD>
</HEAD>
<BODY>
Text(b'ABC\xce\xbf\xcc\x94 \xcc\x81')</BODY></HTML>
2    b'ABC\xce\xbf\xcc\x94 \xcc\x81XYZ' ABCὁ´XYZ NFD
<HTML>
<HEAD>
</HEAD>
<BODY>
Text(b'ABC\xce\xbf\xcc\x94 \xcc\x81')</BODY></HTML>
3    b'ABC\xce\xbf\xcc\x94 \xcc\x81XYZ' ABCὁ´XYZ NFKD

The only thing happening here before gumbo starts, is that the combining diacritics get converted to their corresponding Unicode. The text string will display correctly when copy/pasted in a text editor or a shell.

kevinhendricks commented 4 years ago

Bug is in gumboc.py here:

https://github.com/google/gumbo-parser/blob/master/python/gumbo/gumboc.py#L388

They pass in len(text) and not len(text.encode('utf-8')

Try changing that and all should work correctly.

kevinhendricks commented 4 years ago

Just so you are aware, no one official has responded to anyone for years so this project is effectively abandoned. Most of us have created our own forks for our own projects and have fixed the reported bugs and added features.

My project, Sigil-Ebook has created a sigil gumbo fork. Our gumboc.py version does not have this bug which is why I could not recreate it.

You might want to consider either forking this project like we did and creating your own or if you are looking for a supported html5 gumbo based parser for python, you might want to consider:

https://github.com/kovidgoyal/html5-parser

Good luck!