Closed fake-name closed 1 year ago
Ok, at least part of this is https://github.com/google/gumbo-parser/pull/343, and ~from . import gumboc_tags
fixes that (I could swear I tested that before just moving the contents of gumboc_tags).
The fix suggested there does not work:
>>> import gumbo
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.4/dist-packages/gumbo/__init__.py", line 33, in <module>
from gumbo.gumboc import *
File "/usr/local/lib/python3.4/dist-packages/gumbo/gumboc.py", line 29, in <module>
from . import gumboc_tags
ImportError: cannot import name 'gumboc_tags'
Anyways, the library path search is still completely broken. What system did a normally installed gumbo
library ever work on?
After jumping through the asinine build process for the google test framework, the tests are failing:
durr@bigsrv:~/gumbo-parser⟫ make check
make gumbo_test
make[1]: Entering directory `/home/durr/gumbo-parser'
CXX tests/gumbo_test-attribute.o
CXX tests/gumbo_test-char_ref.o
CXX tests/gumbo_test-parser.o
CXX tests/gumbo_test-string_buffer.o
CXX tests/gumbo_test-string_piece.o
CXX tests/gumbo_test-tokenizer.o
CXX tests/gumbo_test-test_utils.o
CXX tests/gumbo_test-utf8.o
CXX tests/gumbo_test-vector.o
CXXLD gumbo_test
/usr/lib/gcc/x86_64-linux-gnu/4.8/../../../../lib/libgtest.a(gtest-all.cc.o): In function `testing::internal::UnitTestImpl::UnitTestImpl(testing::UnitTest*)':
gtest-all.cc:(.text+0xfc5f): undefined reference to `pthread_key_create'
gtest-all.cc:(.text+0xfe59): undefined reference to `pthread_key_create'
/usr/lib/gcc/x86_64-linux-gnu/4.8/../../../../lib/libgtest.a(gtest-all.cc.o): In function `testing::internal::ThreadLocal<testing::TestPartResultReporterInterface*>::~ThreadLocal()':
gtest-all.cc:(.text._ZN7testing8internal11ThreadLocalIPNS_31TestPartResultReporterInterfaceEED2Ev[_ZN7testing8internal11ThreadLocalIPNS_31TestPartResultReporterInterfaceEED5Ev]+0xb): undefined reference to `pthread_getspecific'
gtest-all.cc:(.text._ZN7testing8internal11ThreadLocalIPNS_31TestPartResultReporterInterfaceEED2Ev[_ZN7testing8internal11ThreadLocalIPNS_31TestPartResultReporterInterfaceEED5Ev]+0x20): undefined reference to `pthread_key_delete'
/usr/lib/gcc/x86_64-linux-gnu/4.8/../../../../lib/libgtest.a(gtest-all.cc.o): In function `testing::internal::ThreadLocal<std::vector<testing::internal::TraceInfo, std::allocator<testing::internal::TraceInfo> > >::~ThreadLocal()':
gtest-all.cc:(.text._ZN7testing8internal11ThreadLocalISt6vectorINS0_9TraceInfoESaIS3_EEED2Ev[_ZN7testing8internal11ThreadLocalISt6vectorINS0_9TraceInfoESaIS3_EEED5Ev]+0x12): undefined reference to `pthread_getspecific'
gtest-all.cc:(.text._ZN7testing8internal11ThreadLocalISt6vectorINS0_9TraceInfoESaIS3_EEED2Ev[_ZN7testing8internal11ThreadLocalISt6vectorINS0_9TraceInfoESaIS3_EEED5Ev]+0x28): undefined reference to `pthread_key_delete'
/usr/lib/gcc/x86_64-linux-gnu/4.8/../../../../lib/libgtest.a(gtest-all.cc.o): In function `testing::internal::ThreadLocal<std::vector<testing::internal::TraceInfo, std::allocator<testing::internal::TraceInfo> > >::GetOrCreateValue() const':
gtest-all.cc:(.text._ZNK7testing8internal11ThreadLocalISt6vectorINS0_9TraceInfoESaIS3_EEE16GetOrCreateValueEv[_ZNK7testing8internal11ThreadLocalISt6vectorINS0_9TraceInfoESaIS3_EEE16GetOrCreateValueEv]+0x16): undefined reference to `pthread_getspecific'
gtest-all.cc:(.text._ZNK7testing8internal11ThreadLocalISt6vectorINS0_9TraceInfoESaIS3_EEE16GetOrCreateValueEv[_ZNK7testing8internal11ThreadLocalISt6vectorINS0_9TraceInfoESaIS3_EEE16GetOrCreateValueEv]+0x171): undefined reference to `pthread_setspecific'
/usr/lib/gcc/x86_64-linux-gnu/4.8/../../../../lib/libgtest.a(gtest-all.cc.o): In function `testing::internal::ThreadLocal<testing::TestPartResultReporterInterface*>::GetOrCreateValue() const':
gtest-all.cc:(.text._ZNK7testing8internal11ThreadLocalIPNS_31TestPartResultReporterInterfaceEE16GetOrCreateValueEv[_ZNK7testing8internal11ThreadLocalIPNS_31TestPartResultReporterInterfaceEE16GetOrCreateValueEv]+0xe): undefined reference to `pthread_getspecific'
gtest-all.cc:(.text._ZNK7testing8internal11ThreadLocalIPNS_31TestPartResultReporterInterfaceEE16GetOrCreateValueEv[_ZNK7testing8internal11ThreadLocalIPNS_31TestPartResultReporterInterfaceEE16GetOrCreateValueEv]+0xbb): undefined reference to `pthread_setspecific'
collect2: error: ld returned 1 exit status
make[1]: *** [gumbo_test] Error 1
make[1]: Leaving directory `/home/durr/gumbo-parser'
make: *** [check-am] Error 2
It looks like something, somewhere, is failing to link libpthread.
I have no idea where to look with the build process, though. I'm only vaguely familiar with make, and have no idea about autoconf.
Yes, python part doesn't work under py3.4. Multiple import errors, shitty formatting style, wrong c library loading code, multiple errors on trying to parse something using adapters. And these errors seem to be omni-versional, not related to py3.
Sigil has a fixed version for python 3.4 along with a new beautiful soup 4 adapter that works with the version of the gumbo parser that has been specially modified for use inside Sigil. I am sure it could be easily fixed/adapted for the official gumbo parser. Let me know if you need or want a copy and I'll take a shot at adapting what we have to work here.
@kevinhendricks
Sigil
Always wondered how can I find these specific about-demonic pictograms.
Are you talking about this code: https://github.com/Sigil-Ebook/sigil-gumbo/tree/master/python/gumbo ? Personally I'm interested to have working html5lib adapter, thanks for pointing onto possible working version. Fast inspection didn't make me happy though, it still has bad importing requiring adding changes in PYTHONPATH to work. Since it has __init__.py
it tries to be a python package and it must use relative imports to behave well.
That version is for specific use inside of Sigil's plugin python 2.7 and python3.4 environment. It is set up to work with BS4 (also used internally by Sigil) not html5lib but the bulk of it should be adaptable.
On Mar 23, 2016, at 1:57 PM, Vitalik Verhovodov notifications@github.com wrote:
@kevinhendricks
Sigil
Always wondered how can I find these specific about-demonic pictograms. Are you talking about this code: https://github.com/Sigil-Ebook/sigil-gumbo/tree/master/python/gumbo ? Personally I'm interested to have working html5lib adapter, thanks for pointing onto possible working version. Fast inspection didn't make me happy though, it still has bad importing requiring adding changes in PYTHONPATH to work. Since it has init.py it tries to be a python package and it must use relative imports to behave well.
— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub
If it helps the sigil-python code we actually deploy to interface to gumbo is here:
https://github.com/Sigil-Ebook/Sigil/tree/master/src/Resource_Files/plugin_launchers/python
See:
sigil_gumboc.py sigil_gumboc_tags.py sigil_gumbo_bs4_adapter.py
On Mar 23, 2016, at 1:57 PM, Vitalik Verhovodov notifications@github.com wrote:
@kevinhendricks
Sigil
Always wondered how can I find these specific about-demonic pictograms. Are you talking about this code: https://github.com/Sigil-Ebook/sigil-gumbo/tree/master/python/gumbo ? Personally I'm interested to have working html5lib adapter, thanks for pointing onto possible working version. Fast inspection didn't make me happy though, it still has bad importing requiring adding changes in PYTHONPATH to work. Since it has init.py it tries to be a python package and it must use relative imports to behave well.
— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub
I'm rewriting the binding here https://github.com/neumond/scutigera/tree/master/scutigera/gumbo Decided to throw away overly complex Enum class, leaving in gumboc almost pure C interface. Wiped gumboc_tags file, no need to keep it, it can be dynamically obtained from dll. Now trying to write adapter which will conform html5lib test suite. Seems like gumbo fails in several tests, need some time to investigate these issues better. Most of tests are successful though.
Is it worth to use CFFI or Cython instead ctypes? PyPy recommends using CFFI (looks like they bet on JIT to optimize interaction) while Cython advertised in multiple articles across internet as fastest possible solution after pure CPython extension.
Hmm. Now the question is what encoding does gumbo consider to use internally. It accepts a buffer of bytes. Some of tests fail at decoding output as utf-8. Even if it treats input as ascii, which is enough to do HTML parsing, it does need some encoding choice to decipher html entities like 䔡
gumbo only works with properly utf-8 encoded html files. if an html file has any other encoding, it must be converted to utf-8 before being parsed by gumbo. See the readme on this site for details. Also the source being parsed must continue to exist be stored in memory as pointers into the original source exist in the parsed tree.
it must be converted to utf-8 before being parsed by gumbo
Exactly as I supposed it to be. Well, for test FOO�
gumbo over ctypes binding outputs text node with b'FOO\xc7'
.
If I hack prettyprint
example to output char codes:
// line 195
for (int j = 0; j < 8; j++){
std::cout << ((int) child->v.text.text[j]) << std::endl;
}
It's b'FOO\xc7\n'
(70,79,79,-57,10,0). Using html5lib parser I get unicode point 65533 or b'FOO\xef\xbf\xbd'
in utf-8.
Magic.
By the way, how can I check html5lib test suite, some tests look unreasonable for me. Id est gumbo works properly and html5lib test expects wrong, e.g. for noscript tag test.
UPD. It is an interesting character (65533) http://www.fileformat.info/info/unicode/char/0fffd/index.htm
REPLACEMENT CHARACTER used to replace an incoming character whose value is unknown or unrepresentable in Unicode
UPD2. Very interesting :)
b'FOO\xc7'.decode('utf-8', errors='replace').encode('utf-8')
b'FOO\xef\xbf\xbd'
UPD3. Now it's better.
168 failed, 15944 passed, 3056 skipped 124 failed, 15988 passed, 3056 skipped
b'FOO\xc7'.decode('utf-8', errors='replace').encode('utf-8')
As far as I'm aware, Gumbo's output shouldn't ever be invalid UTF-8. Certainly, per spec, Gumbo should be outputting U+FFFD for that, and definitely shouldn't be output something broken!
By the way, how can I check html5lib test suite, some tests look unreasonable for me. Id est gumbo works properly and html5lib test expects wrong, e.g. for noscript tag test.
A good starting point nowadays is look what your favourite browser does on the Live DOM Viewer though that doesn't work in the case of noscript
tests with #script-off
. Otherwise, uh, the best things are likely either asking in #whatwg on freenode or filing a bug on html5lib-tests. Or just ping me here.
Gumbo should be outputting U+FFFD for that
Ok, I think its time to dig gumbo code to repair this. Who knows maybe in some cases gumbo will output valid utf8 where it must output replacements.
Regarding noscript, that's one of obscure things I didn't know about. Considering this example
<p id="status"><noscript><strong>A</strong></noscript><span>B</span></p>
If I inspect DOM in firefox with javascript turned on I have <noscript>
node with inner text "<strong>A</strong>"
. But if I disable javascript I have separate node of <strong>
with text "A"
. Looks like gumbo behaves as if I disable javascript, but html5lib test expects it to be as if I enable javascript. There should be parameter for parse
I guess.
So if you use a numeric entity like & # 111111111111 ; (which takes minimum 5 bytes to even represent as hex) or any other illegal unicode code point, the spec says to output UxFFFD? Is that right?
I know gumbo does output proper utf-8 encoded values for legal numeric entities.
For example: & # x F F F D ; results in the proper utf-8 byte string of 0xEF 0xBF 0xBD in the serialized output.
The problem is overflow of an int type in src/char_ref.c
int codepoint = 0;
bool status = true;
do {
codepoint = (codepoint * (is_hex ? 16 : 10)) + digit;
utf8iterator_next(input);
digit = parse_digit(utf8iterator_current(input), is_hex);
} while (digit != -1);
if (utf8iterator_current(input) != ';') {
add_codepoint_error(
parser, input, GUMBO_ERR_NUMERIC_CHAR_REF_WITHOUT_SEMICOLON, codepoint);
status = false;
} else {
utf8iterator_next(input);
}
int replacement = maybe_replace_codepoint(codepoint);
if (replacement != -1) {
add_codepoint_error(
parser, input, GUMBO_ERR_NUMERIC_CHAR_REF_INVALID, codepoint);
*output = replacement;
return false;
}
if ((codepoint >= 0xd800 && codepoint <= 0xdfff) || codepoint > 0x10ffff) {
add_codepoint_error(
parser, input, GUMBO_ERR_NUMERIC_CHAR_REF_INVALID, codepoint);
*output = 0xfffd;
return false;
}
Before each iteration for adding the next char digit it needs to check and prevent overflow of the codepoint value (anything greater than 0x10ffff) while still continuing to consume the bad numeric entity until it gets to a non-digit.
If you look at 111111111111 as hex (0x19debd01c7) it overflows an int type and the last byte value is the one you are seeing in the output (0xc7).
The overflow prevents this snippet of code from working:
if ((codepoint >= 0xd800 && codepoint <= 0xdfff) || codepoint > 0x10ffff) {
add_codepoint_error(
parser, input, GUMBO_ERR_NUMERIC_CHAR_REF_INVALID, codepoint);
*output = 0xfffd;
return false;
}
The problem is char_ref.c is preprocessed to make char_ref.rl for speed, so once a proper fix is made, that the char_ref.rl will have to be recreated.
Hope this helps.
FWIW, since 0x10ffff * 16 easily fits inside an int, we do not need to catch int overlfow, we just need to catch overflow of 0x10ffff the first time but keep parsing until a non-digit. The final snippet (see above) will take care of the rest..
So this patch in char_ref.c did the trick for me:
--- char_ref.c.keep 2016-04-20 10:42:38.000000000 -0400
+++ char_ref.c 2016-04-20 10:48:08.000000000 -0400
@@ -166,8 +166,10 @@
int codepoint = 0;
bool status = true;
+ bool bad_value = false;
do {
- codepoint = (codepoint * (is_hex ? 16 : 10)) + digit;
+ if (!bad_value) codepoint = (codepoint * (is_hex ? 16 : 10)) + digit;
+ bad_value = codepoint > 0x10ffff;
utf8iterator_next(input);
digit = parse_digit(utf8iterator_current(input), is_hex);
} while (digit != -1);
So if you use a numeric entity like & # 111111111111 ; (which takes minimum 5 bytes to even represent as hex) or any other illegal unicode code point, the spec says to output UxFFFD? Is that right?
That's right.
FWIW, that bug looks almost identical to the Gecko bug that led to those tests being written; the fix LGTM.
Ok, I think its time to dig gumbo code to repair this. Who knows maybe in some cases gumbo will output valid utf8 where it must output replacements.
I'd strongly encourage to use .decode("utf8", "strict")
and catch a UnicodeDecodeError
and treat it as a test failure, because any case where we have invalid UTF-8 is a bug in Gumbo.
Looks like gumbo behaves as if I disable javascript, but html5lib test expects it to be as if I enable javascript.
If there's a test that expects the script enabled parsing, it should have the #script-on
flag (see https://github.com/html5lib/html5lib-tests/blob/master/tree-construction/README.md for documentation). If there's one that doesn't have it, please send a PR adding it! (The testsuite is predominantly run with scripting enabled, so I wouldn't be surprised if some tests were missing the needed flags.)
A simpler patch might be to remove bad_value and simply test if codepoint <= 0x10ffff before scaling the codepoint and adding the digit.
Either way once it exceeds 0x10ffff it will stop updating and prevent any overflow.
Trying to add script parameter into html5lib.
[SOLVED. Found .pytest.expect file, https://github.com/gsnedders/pytest-expect] Somehow all tests with script-off are masked with xfail. I've commented out DataLossWarning try-catcher, multiple tests started to fail, but script-off ones are still xfail-masked. Even grepping 'xfail' didn't help, there's no such text in whole project.
Command to run tests ignoring expect plugin: python -m pytest -s -p no:expect -m "ElementTree and namespaced" html5lib/tests/
23593 passed, 3064 skipped, 573 xfailed, 48 xpassed
Looks like 48 tests are working correctly now. How far did I go in just using gumbo in my project..
Regarding pullrequest for gumbo-parser. I don't know. I have version that works well with py3 and html5lib only. At least now I can import and use it, and it can use system-wide gumbo installation. I guess @kevinhendricks has good implementation for beautiful soup, not sure whether it's py2 or py3. It has many changes including testing through drop-in replacement of native html5lib parser and removal of gumboc_tags.py.
@nostrademons : what do you require for such PR? Do you require py2 and existing importing scheme?
beautiful soup 4 adapter is python3.
FWIW, to fix the issue of numeric overflow preventing invalid numeric entities (such as
& # 111111111111 ; from being successfully detected as an error: Sigil's gumbo has included the following changes in our older version of these files.
https://github.com/Sigil-Ebook/Sigil/commit/4a7afd4d216b14faea2135bb1ac105175bc8ae05
I fixed this problem by created a link from libgumbo.so.1.0.0
to gumbo/libgumbo.so
By default, libgumbo.so.1.0.0
can be found at /usr/local/lib/
, so
ln -s /usr/local/lib/libgumbo.so.1.0.0 /usr/local/lib/python2.7/dist-packages/gumbo-0.10.1-py2.7.egg/gumbo/libgumbo.so
Nick-Alam's solution worked for me on Ubuntu14.04. pydoc3 gumbo lists the file in the gumbo package as: bs4_adapter bs4_adapter_test gumboc gumboc_tags gumboc_test html5lib_adapter html5lib_adapter_test libgumbo soup_adapter soup_adapter_test All of which are installed by the python3 setup.py install script except libgumbo.so .
I'm trying to get gumbo to work with python on ubuntu 14.04, and not having much work.
I built and installed gumbo by cloning the master branch:
And then the python extensions:
On python 3:
I patched the import to
gumboc_tags
by just copying the contents of that file (it's just a single big array) intogumboc.py
, then fixed the library search path issue (I just hardcoded the library path to"/usr/local/lib/libgumbo.so.1.0.0"
), and it then imports, butgumbo.soup_parse
(which is what I want) doesn't seem to be present:I also attempted to see if the version in PyPi would work, and it's non-functional after install for python3 (my app is python 3, I tested python 2 just to be thorough).