google / gumbo-parser

An HTML5 parsing library in pure C99
Apache License 2.0
5.16k stars 662 forks source link

Python interface seems to not work. #356

Closed fake-name closed 1 year ago

fake-name commented 8 years ago

I'm trying to get gumbo to work with python on ubuntu 14.04, and not having much work.

I built and installed gumbo by cloning the master branch:

durr@bigsrv:~/gumbo-parser⟫ ./autogen.sh
+ libtoolize
libtoolize: putting auxiliary files in `.'.
libtoolize: linking file `./ltmain.sh'
libtoolize: putting macros in AC_CONFIG_MACRO_DIR, `m4'.
libtoolize: linking file `m4/libtool.m4'
libtoolize: linking file `m4/ltoptions.m4'
libtoolize: linking file `m4/ltsugar.m4'
libtoolize: linking file `m4/ltversion.m4'
libtoolize: linking file `m4/lt~obsolete.m4'
+ aclocal -I m4
+ autoconf
+ automake --add-missing
configure.ac:13: installing './compile'
configure.ac:33: installing './config.guess'
configure.ac:33: installing './config.sub'
configure.ac:31: installing './install-sh'
configure.ac:31: installing './missing'
Makefile.am: installing './depcomp'
parallel-tests: installing './test-driver'
durr@bigsrv:~/gumbo-parser⟫ ./configure
checking for g++... g++
checking whether the C++ compiler works... yes
checking for C++ compiler default output file name... a.out
checking for suffix of executables...
checking whether we are cross compiling... no
checking for suffix of object files... o
checking whether we are using the GNU C++ compiler... yes
checking whether g++ accepts -g... yes
checking for gcc... gcc
checking whether we are using the GNU C compiler... yes
checking whether gcc accepts -g... yes
checking for gcc option to accept ISO C89... none needed
checking whether gcc understands -c and -o together... yes
checking for gcc option to accept ISO C99... -std=gnu99
checking how to run the C preprocessor... gcc -std=gnu99 -E
checking for grep that handles long lines and -e... /bin/grep
checking for egrep... /bin/grep -E
checking for ANSI C header files... yes
checking for sys/types.h... yes
checking for sys/stat.h... yes
checking for stdlib.h... yes
checking for string.h... yes
checking for memory.h... yes
checking for strings.h... yes
checking for inttypes.h... yes
checking for stdint.h... yes
checking for unistd.h... yes
checking stddef.h usability... yes
checking stddef.h presence... yes
checking for stddef.h... yes
checking for stdlib.h... (cached) yes
checking for string.h... (cached) yes
checking for strings.h... (cached) yes
checking for inline... inline
checking for size_t... yes
checking for main in -lgtest_main... no
checking for a BSD-compatible install... /usr/bin/install -c
checking whether build environment is sane... yes
/bin/bash: /home/durr/missing: No such file or directory
configure: WARNING: 'missing' script is too old or missing
checking for a thread-safe mkdir -p... /bin/mkdir -p
checking for gawk... gawk
checking whether make sets $(MAKE)... yes
checking for style of include used by make... GNU
checking whether make supports nested variables... yes
checking dependency style of gcc -std=gnu99... gcc3
checking dependency style of g++... gcc3
checking whether make supports nested variables... (cached) yes
checking build system type... x86_64-unknown-linux-gnu
checking host system type... x86_64-unknown-linux-gnu
checking how to print strings... printf
checking for a sed that does not truncate output... /bin/sed
checking for fgrep... /bin/grep -F
checking for ld used by gcc -std=gnu99... /usr/bin/ld
checking if the linker (/usr/bin/ld) is GNU ld... yes
checking for BSD- or MS-compatible name lister (nm)... /usr/bin/nm -B
checking the name lister (/usr/bin/nm -B) interface... BSD nm
checking whether ln -s works... yes
checking the maximum length of command line arguments... 1572864
checking whether the shell understands some XSI constructs... yes
checking whether the shell understands "+="... yes
checking how to convert x86_64-unknown-linux-gnu file names to x86_64-unknown-linux-gnu format... func_convert_file_noop
checking how to convert x86_64-unknown-linux-gnu file names to toolchain format... func_convert_file_noop
checking for /usr/bin/ld option to reload object files... -r
checking for objdump... objdump
checking how to recognize dependent libraries... pass_all
checking for dlltool... no
checking how to associate runtime and link libraries... printf %s\n
checking for ar... ar
checking for archiver @FILE support... @
checking for strip... strip
checking for ranlib... ranlib
checking command to parse /usr/bin/nm -B output from gcc -std=gnu99 object... ok
checking for sysroot... no
checking for mt... mt
checking if mt is a manifest tool... no
checking for dlfcn.h... yes
checking for objdir... .libs
checking if gcc -std=gnu99 supports -fno-rtti -fno-exceptions... no
checking for gcc -std=gnu99 option to produce PIC... -fPIC -DPIC
checking if gcc -std=gnu99 PIC flag -fPIC -DPIC works... yes
checking if gcc -std=gnu99 static flag -static works... yes
checking if gcc -std=gnu99 supports -c -o file.o... yes
checking if gcc -std=gnu99 supports -c -o file.o... (cached) yes
checking whether the gcc -std=gnu99 linker (/usr/bin/ld -m elf_x86_64) supports shared libraries... yes
checking whether -lc should be explicitly linked in... no
checking dynamic linker characteristics... GNU/Linux ld.so
checking how to hardcode library paths into programs... immediate
checking whether stripping libraries is possible... yes
checking if libtool supports shared libraries... yes
checking whether to build shared libraries... yes
checking whether to build static libraries... yes
checking how to run the C++ preprocessor... g++ -E
checking for ld used by g++... /usr/bin/ld -m elf_x86_64
checking if the linker (/usr/bin/ld -m elf_x86_64) is GNU ld... yes
checking whether the g++ linker (/usr/bin/ld -m elf_x86_64) supports shared libraries... yes
checking for g++ option to produce PIC... -fPIC -DPIC
checking if g++ PIC flag -fPIC -DPIC works... yes
checking if g++ static flag -static works... yes
checking if g++ supports -c -o file.o... yes
checking if g++ supports -c -o file.o... (cached) yes
checking whether the g++ linker (/usr/bin/ld -m elf_x86_64) supports shared libraries... yes
checking dynamic linker characteristics... (cached) GNU/Linux ld.so
checking how to hardcode library paths into programs... immediate
checking that generated files are newer than configure... done
configure: creating ./config.status
config.status: creating Makefile
config.status: creating gumbo.pc
config.status: executing depfiles commands
config.status: executing libtool commands
durr@bigsrv:~/gumbo-parser⟫ make
  CC       src/libgumbo_la-attribute.lo
  CC       src/libgumbo_la-char_ref.lo
  CC       src/libgumbo_la-error.lo
  CC       src/libgumbo_la-parser.lo
  CC       src/libgumbo_la-string_buffer.lo
  CC       src/libgumbo_la-string_piece.lo
  CC       src/libgumbo_la-tag.lo
  CC       src/libgumbo_la-tokenizer.lo
  CC       src/libgumbo_la-utf8.lo
  CC       src/libgumbo_la-util.lo
  CC       src/libgumbo_la-vector.lo
  CCLD     libgumbo.la
  CXX      examples/clean_text.o
  CXXLD    clean_text
  CXX      examples/find_links.o
  CXXLD    find_links
  CC       examples/get_title.o
  CCLD     get_title
  CXX      examples/positions_of_class.o
  CXXLD    positions_of_class
  CXX      benchmarks/benchmark.o
  CXXLD    benchmark
  CXX      examples/serialize.o
  CXXLD    serialize
  CXX      examples/prettyprint.o
  CXXLD    prettyprint
durr@bigsrv:~/gumbo-parser⟫ sudo make install
[sudo] password for durr:
make[1]: Entering directory `/home/durr/gumbo-parser'
 /bin/mkdir -p '/usr/local/lib'
 /bin/bash ./libtool   --mode=install /usr/bin/install -c   libgumbo.la '/usr/local/lib'
libtool: install: /usr/bin/install -c .libs/libgumbo.so.1.0.0 /usr/local/lib/libgumbo.so.1.0.0
libtool: install: (cd /usr/local/lib && { ln -s -f libgumbo.so.1.0.0 libgumbo.so.1 || { rm -f libgumbo.so.1 && ln -s libgumbo.so.1.0.0 libgumbo.so.1; }; })
libtool: install: (cd /usr/local/lib && { ln -s -f libgumbo.so.1.0.0 libgumbo.so || { rm -f libgumbo.so && ln -s libgumbo.so.1.0.0 libgumbo.so; }; })
libtool: install: /usr/bin/install -c .libs/libgumbo.lai /usr/local/lib/libgumbo.la
libtool: install: /usr/bin/install -c .libs/libgumbo.a /usr/local/lib/libgumbo.a
libtool: install: chmod 644 /usr/local/lib/libgumbo.a
libtool: install: ranlib /usr/local/lib/libgumbo.a
libtool: finish: PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/sbin" ldconfig -n /usr/local/lib
----------------------------------------------------------------------
Libraries have been installed in:
   /usr/local/lib

If you ever happen to want to link against installed libraries
in a given directory, LIBDIR, you must either use libtool, and
specify the full pathname of the library, or use the `-LLIBDIR'
flag during linking and do at least one of the following:
   - add LIBDIR to the `LD_LIBRARY_PATH' environment variable
     during execution
   - add LIBDIR to the `LD_RUN_PATH' environment variable
     during linking
   - use the `-Wl,-rpath -Wl,LIBDIR' linker flag
   - have your system administrator add LIBDIR to `/etc/ld.so.conf'

See any operating system documentation about shared libraries for
more information, such as the ld(1) and ld.so(8) manual pages.
----------------------------------------------------------------------
 /bin/mkdir -p '/usr/local/include'
 /usr/bin/install -c -m 644 src/gumbo.h src/tag_enum.h '/usr/local/include'
 /bin/mkdir -p '/usr/local/lib/pkgconfig'
 /usr/bin/install -c -m 644 gumbo.pc '/usr/local/lib/pkgconfig'
make[1]: Leaving directory `/home/durr/gumbo-parser'

And then the python extensions:

durr@bigsrv:~/gumbo-parser⟫ sudo python setup.py install
running install
running bdist_egg
running egg_info
writing python/gumbo.egg-info/PKG-INFO
writing top-level names to python/gumbo.egg-info/top_level.txt
writing dependency_links to python/gumbo.egg-info/dependency_links.txt
writing pbr to python/gumbo.egg-info/pbr.json
reading manifest file 'python/gumbo.egg-info/SOURCES.txt'
writing manifest file 'python/gumbo.egg-info/SOURCES.txt'
installing library code to build/bdist.linux-x86_64/egg
running install_lib
running build_py
creating build/lib.linux-x86_64-2.7
creating build/lib.linux-x86_64-2.7/gumbo
copying python/gumbo/html5lib_adapter.py -> build/lib.linux-x86_64-2.7/gumbo
copying python/gumbo/gumboc.py -> build/lib.linux-x86_64-2.7/gumbo
copying python/gumbo/soup_adapter.py -> build/lib.linux-x86_64-2.7/gumbo
copying python/gumbo/__init__.py -> build/lib.linux-x86_64-2.7/gumbo
copying python/gumbo/html5lib_adapter_test.py -> build/lib.linux-x86_64-2.7/gumbo
copying python/gumbo/gumboc_tags.py -> build/lib.linux-x86_64-2.7/gumbo
copying python/gumbo/soup_adapter_test.py -> build/lib.linux-x86_64-2.7/gumbo
copying python/gumbo/gumboc_test.py -> build/lib.linux-x86_64-2.7/gumbo
creating build/bdist.linux-x86_64/egg
creating build/bdist.linux-x86_64/egg/gumbo
copying build/lib.linux-x86_64-2.7/gumbo/html5lib_adapter.py -> build/bdist.linux-x86_64/egg/gumbo
copying build/lib.linux-x86_64-2.7/gumbo/gumboc.py -> build/bdist.linux-x86_64/egg/gumbo
copying build/lib.linux-x86_64-2.7/gumbo/soup_adapter.py -> build/bdist.linux-x86_64/egg/gumbo
copying build/lib.linux-x86_64-2.7/gumbo/__init__.py -> build/bdist.linux-x86_64/egg/gumbo
copying build/lib.linux-x86_64-2.7/gumbo/html5lib_adapter_test.py -> build/bdist.linux-x86_64/egg/gumbo
copying build/lib.linux-x86_64-2.7/gumbo/gumboc_tags.py -> build/bdist.linux-x86_64/egg/gumbo
copying build/lib.linux-x86_64-2.7/gumbo/soup_adapter_test.py -> build/bdist.linux-x86_64/egg/gumbo
copying build/lib.linux-x86_64-2.7/gumbo/gumboc_test.py -> build/bdist.linux-x86_64/egg/gumbo
byte-compiling build/bdist.linux-x86_64/egg/gumbo/html5lib_adapter.py to html5lib_adapter.pyc
byte-compiling build/bdist.linux-x86_64/egg/gumbo/gumboc.py to gumboc.pyc
byte-compiling build/bdist.linux-x86_64/egg/gumbo/soup_adapter.py to soup_adapter.pyc
byte-compiling build/bdist.linux-x86_64/egg/gumbo/__init__.py to __init__.pyc
byte-compiling build/bdist.linux-x86_64/egg/gumbo/html5lib_adapter_test.py to html5lib_adapter_test.pyc
byte-compiling build/bdist.linux-x86_64/egg/gumbo/gumboc_tags.py to gumboc_tags.pyc
byte-compiling build/bdist.linux-x86_64/egg/gumbo/soup_adapter_test.py to soup_adapter_test.pyc
byte-compiling build/bdist.linux-x86_64/egg/gumbo/gumboc_test.py to gumboc_test.pyc
creating build/bdist.linux-x86_64/egg/EGG-INFO
copying python/gumbo.egg-info/PKG-INFO -> build/bdist.linux-x86_64/egg/EGG-INFO
copying python/gumbo.egg-info/SOURCES.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
copying python/gumbo.egg-info/dependency_links.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
copying python/gumbo.egg-info/not-zip-safe -> build/bdist.linux-x86_64/egg/EGG-INFO
copying python/gumbo.egg-info/pbr.json -> build/bdist.linux-x86_64/egg/EGG-INFO
copying python/gumbo.egg-info/top_level.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
creating 'dist/gumbo-0.10.1-py2.7.egg' and adding 'build/bdist.linux-x86_64/egg' to it
removing 'build/bdist.linux-x86_64/egg' (and everything under it)
Processing gumbo-0.10.1-py2.7.egg
creating /usr/local/lib/python2.7/dist-packages/gumbo-0.10.1-py2.7.egg
Extracting gumbo-0.10.1-py2.7.egg to /usr/local/lib/python2.7/dist-packages
Adding gumbo 0.10.1 to easy-install.pth file

Installed /usr/local/lib/python2.7/dist-packages/gumbo-0.10.1-py2.7.egg
Processing dependencies for gumbo==0.10.1
Finished processing dependencies for gumbo==0.10.1
durr@bigsrv:~/gumbo-parser⟫ python
Python 2.7.6 (default, Jun 22 2015, 17:58:13)
[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import gumbo
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/gumbo-0.10.1-py2.7.egg/gumbo/__init__.py", line 33, in <module>
    from gumbo.gumboc import *
  File "/usr/local/lib/python2.7/dist-packages/gumbo-0.10.1-py2.7.egg/gumbo/gumboc.py", line 44, in <module>
    os.path.dirname(__file__), _name_of_lib))
  File "/usr/lib/python2.7/ctypes/__init__.py", line 443, in LoadLibrary
    return self._dlltype(name)
  File "/usr/lib/python2.7/ctypes/__init__.py", line 365, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: /usr/local/lib/python2.7/dist-packages/gumbo-0.10.1-py2.7.egg/gumbo/libgumbo.so: cannot open shared object file: No such file or directory
>>>

On python 3:

durr@bigsrv:~/gumbo-parser⟫ sudo python3 setup.py install
running install
Checking .pth file support in /usr/local/lib/python3.4/dist-packages/
/usr/bin/python3 -E -c pass
TEST PASSED: /usr/local/lib/python3.4/dist-packages/ appears to support .pth files
running bdist_egg
running egg_info
writing dependency_links to python/gumbo.egg-info/dependency_links.txt
writing python/gumbo.egg-info/PKG-INFO
writing top-level names to python/gumbo.egg-info/top_level.txt
reading manifest file 'python/gumbo.egg-info/SOURCES.txt'
writing manifest file 'python/gumbo.egg-info/SOURCES.txt'
installing library code to build/bdist.linux-x86_64/egg
running install_lib
running build_py
creating build/bdist.linux-x86_64/egg
creating build/bdist.linux-x86_64/egg/gumbo
copying build/lib/gumbo/html5lib_adapter.py -> build/bdist.linux-x86_64/egg/gumbo
copying build/lib/gumbo/gumboc.py -> build/bdist.linux-x86_64/egg/gumbo
copying build/lib/gumbo/soup_adapter.py -> build/bdist.linux-x86_64/egg/gumbo
copying build/lib/gumbo/__init__.py -> build/bdist.linux-x86_64/egg/gumbo
copying build/lib/gumbo/html5lib_adapter_test.py -> build/bdist.linux-x86_64/egg/gumbo
copying build/lib/gumbo/gumboc_tags.py -> build/bdist.linux-x86_64/egg/gumbo
copying build/lib/gumbo/soup_adapter_test.py -> build/bdist.linux-x86_64/egg/gumbo
copying build/lib/gumbo/gumboc_test.py -> build/bdist.linux-x86_64/egg/gumbo
byte-compiling build/bdist.linux-x86_64/egg/gumbo/html5lib_adapter.py to html5lib_adapter.cpython-34.pyc
byte-compiling build/bdist.linux-x86_64/egg/gumbo/gumboc.py to gumboc.cpython-34.pyc
byte-compiling build/bdist.linux-x86_64/egg/gumbo/soup_adapter.py to soup_adapter.cpython-34.pyc
byte-compiling build/bdist.linux-x86_64/egg/gumbo/__init__.py to __init__.cpython-34.pyc
byte-compiling build/bdist.linux-x86_64/egg/gumbo/html5lib_adapter_test.py to html5lib_adapter_test.cpython-34.pyc
byte-compiling build/bdist.linux-x86_64/egg/gumbo/gumboc_tags.py to gumboc_tags.cpython-34.pyc
byte-compiling build/bdist.linux-x86_64/egg/gumbo/soup_adapter_test.py to soup_adapter_test.cpython-34.pyc
byte-compiling build/bdist.linux-x86_64/egg/gumbo/gumboc_test.py to gumboc_test.cpython-34.pyc
creating build/bdist.linux-x86_64/egg/EGG-INFO
copying python/gumbo.egg-info/PKG-INFO -> build/bdist.linux-x86_64/egg/EGG-INFO
copying python/gumbo.egg-info/SOURCES.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
copying python/gumbo.egg-info/dependency_links.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
copying python/gumbo.egg-info/not-zip-safe -> build/bdist.linux-x86_64/egg/EGG-INFO
copying python/gumbo.egg-info/top_level.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
creating 'dist/gumbo-0.10.1-py3.4.egg' and adding 'build/bdist.linux-x86_64/egg' to it
removing 'build/bdist.linux-x86_64/egg' (and everything under it)
Processing gumbo-0.10.1-py3.4.egg
creating /usr/local/lib/python3.4/dist-packages/gumbo-0.10.1-py3.4.egg
Extracting gumbo-0.10.1-py3.4.egg to /usr/local/lib/python3.4/dist-packages
Adding gumbo 0.10.1 to easy-install.pth file

Installed /usr/local/lib/python3.4/dist-packages/gumbo-0.10.1-py3.4.egg
Processing dependencies for gumbo==0.10.1
Finished processing dependencies for gumbo==0.10.1

durr@bigsrv:~⟫ python3
Python 3.4.3 (default, Oct 14 2015, 20:28:29)
[GCC 4.8.4] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import gumbo
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.4/dist-packages/gumbo-0.10.1-py3.4.egg/gumbo/__init__.py", line 33, in <module>
    from gumbo.gumboc import *
  File "/usr/local/lib/python3.4/dist-packages/gumbo-0.10.1-py3.4.egg/gumbo/gumboc.py", line 29, in <module>
    import gumboc_tags
ImportError: No module named 'gumboc_tags'
>>>

{{{ moved array to fix that import issue }}}

>>> import gumbo
Traceback (most recent call last):
  File "/usr/local/lib/python3.4/dist-packages/gumbo-0.10.1-py3.4.egg/gumbo/gumboc.py", line 198, in <module>
    os.path.dirname(__file__), '..', '..', '.libs', _name_of_lib))
  File "/usr/lib/python3.4/ctypes/__init__.py", line 429, in LoadLibrary
    return self._dlltype(name)
  File "/usr/lib/python3.4/ctypes/__init__.py", line 351, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: /usr/local/lib/python3.4/dist-packages/gumbo-0.10.1-py3.4.egg/gumbo/../../.libs/libgumbo.so: cannot open shared object file: No such file or directory

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.4/dist-packages/gumbo-0.10.1-py3.4.egg/gumbo/__init__.py", line 33, in <module>
    from gumbo.gumboc import *
  File "/usr/local/lib/python3.4/dist-packages/gumbo-0.10.1-py3.4.egg/gumbo/gumboc.py", line 202, in <module>
    os.path.dirname(__file__), _name_of_lib))
  File "/usr/lib/python3.4/ctypes/__init__.py", line 429, in LoadLibrary
    return self._dlltype(name)
  File "/usr/lib/python3.4/ctypes/__init__.py", line 351, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: /usr/local/lib/python3.4/dist-packages/gumbo-0.10.1-py3.4.egg/gumbo/libgumbo.so: cannot open shared object file: No such file or directory

I patched the import to gumboc_tags by just copying the contents of that file (it's just a single big array) into gumboc.py, then fixed the library search path issue (I just hardcoded the library path to "/usr/local/lib/libgumbo.so.1.0.0"), and it then imports, but gumbo.soup_parse (which is what I want) doesn't seem to be present:

>>> import gumbo
>>> gumbo
<module 'gumbo' from '/usr/local/lib/python3.4/dist-packages/gumbo-0.10.1-py3.4.egg/gumbo/__init__.py'>
>>> gumbo.soup_parse
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'module' object has no attribute 'soup_parse'
>>>

I also attempted to see if the version in PyPi would work, and it's non-functional after install for python3 (my app is python 3, I tested python 2 just to be thorough).

fake-name commented 8 years ago

Ok, at least part of this is https://github.com/google/gumbo-parser/pull/343, and from . import gumboc_tags fixes that (I could swear I tested that before just moving the contents of gumboc_tags).~

The fix suggested there does not work:

>>> import gumbo
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.4/dist-packages/gumbo/__init__.py", line 33, in <module>
    from gumbo.gumboc import *
  File "/usr/local/lib/python3.4/dist-packages/gumbo/gumboc.py", line 29, in <module>
    from . import gumboc_tags
ImportError: cannot import name 'gumboc_tags'

Anyways, the library path search is still completely broken. What system did a normally installed gumbo library ever work on?

fake-name commented 8 years ago

After jumping through the asinine build process for the google test framework, the tests are failing:

durr@bigsrv:~/gumbo-parser⟫ make check
make  gumbo_test
make[1]: Entering directory `/home/durr/gumbo-parser'
  CXX      tests/gumbo_test-attribute.o
  CXX      tests/gumbo_test-char_ref.o
  CXX      tests/gumbo_test-parser.o
  CXX      tests/gumbo_test-string_buffer.o
  CXX      tests/gumbo_test-string_piece.o
  CXX      tests/gumbo_test-tokenizer.o
  CXX      tests/gumbo_test-test_utils.o
  CXX      tests/gumbo_test-utf8.o
  CXX      tests/gumbo_test-vector.o
  CXXLD    gumbo_test
/usr/lib/gcc/x86_64-linux-gnu/4.8/../../../../lib/libgtest.a(gtest-all.cc.o): In function `testing::internal::UnitTestImpl::UnitTestImpl(testing::UnitTest*)':
gtest-all.cc:(.text+0xfc5f): undefined reference to `pthread_key_create'
gtest-all.cc:(.text+0xfe59): undefined reference to `pthread_key_create'
/usr/lib/gcc/x86_64-linux-gnu/4.8/../../../../lib/libgtest.a(gtest-all.cc.o): In function `testing::internal::ThreadLocal<testing::TestPartResultReporterInterface*>::~ThreadLocal()':
gtest-all.cc:(.text._ZN7testing8internal11ThreadLocalIPNS_31TestPartResultReporterInterfaceEED2Ev[_ZN7testing8internal11ThreadLocalIPNS_31TestPartResultReporterInterfaceEED5Ev]+0xb): undefined reference to `pthread_getspecific'
gtest-all.cc:(.text._ZN7testing8internal11ThreadLocalIPNS_31TestPartResultReporterInterfaceEED2Ev[_ZN7testing8internal11ThreadLocalIPNS_31TestPartResultReporterInterfaceEED5Ev]+0x20): undefined reference to `pthread_key_delete'
/usr/lib/gcc/x86_64-linux-gnu/4.8/../../../../lib/libgtest.a(gtest-all.cc.o): In function `testing::internal::ThreadLocal<std::vector<testing::internal::TraceInfo, std::allocator<testing::internal::TraceInfo> > >::~ThreadLocal()':
gtest-all.cc:(.text._ZN7testing8internal11ThreadLocalISt6vectorINS0_9TraceInfoESaIS3_EEED2Ev[_ZN7testing8internal11ThreadLocalISt6vectorINS0_9TraceInfoESaIS3_EEED5Ev]+0x12): undefined reference to `pthread_getspecific'
gtest-all.cc:(.text._ZN7testing8internal11ThreadLocalISt6vectorINS0_9TraceInfoESaIS3_EEED2Ev[_ZN7testing8internal11ThreadLocalISt6vectorINS0_9TraceInfoESaIS3_EEED5Ev]+0x28): undefined reference to `pthread_key_delete'
/usr/lib/gcc/x86_64-linux-gnu/4.8/../../../../lib/libgtest.a(gtest-all.cc.o): In function `testing::internal::ThreadLocal<std::vector<testing::internal::TraceInfo, std::allocator<testing::internal::TraceInfo> > >::GetOrCreateValue() const':
gtest-all.cc:(.text._ZNK7testing8internal11ThreadLocalISt6vectorINS0_9TraceInfoESaIS3_EEE16GetOrCreateValueEv[_ZNK7testing8internal11ThreadLocalISt6vectorINS0_9TraceInfoESaIS3_EEE16GetOrCreateValueEv]+0x16): undefined reference to `pthread_getspecific'
gtest-all.cc:(.text._ZNK7testing8internal11ThreadLocalISt6vectorINS0_9TraceInfoESaIS3_EEE16GetOrCreateValueEv[_ZNK7testing8internal11ThreadLocalISt6vectorINS0_9TraceInfoESaIS3_EEE16GetOrCreateValueEv]+0x171): undefined reference to `pthread_setspecific'
/usr/lib/gcc/x86_64-linux-gnu/4.8/../../../../lib/libgtest.a(gtest-all.cc.o): In function `testing::internal::ThreadLocal<testing::TestPartResultReporterInterface*>::GetOrCreateValue() const':
gtest-all.cc:(.text._ZNK7testing8internal11ThreadLocalIPNS_31TestPartResultReporterInterfaceEE16GetOrCreateValueEv[_ZNK7testing8internal11ThreadLocalIPNS_31TestPartResultReporterInterfaceEE16GetOrCreateValueEv]+0xe): undefined reference to `pthread_getspecific'
gtest-all.cc:(.text._ZNK7testing8internal11ThreadLocalIPNS_31TestPartResultReporterInterfaceEE16GetOrCreateValueEv[_ZNK7testing8internal11ThreadLocalIPNS_31TestPartResultReporterInterfaceEE16GetOrCreateValueEv]+0xbb): undefined reference to `pthread_setspecific'
collect2: error: ld returned 1 exit status
make[1]: *** [gumbo_test] Error 1
make[1]: Leaving directory `/home/durr/gumbo-parser'
make: *** [check-am] Error 2

It looks like something, somewhere, is failing to link libpthread.

I have no idea where to look with the build process, though. I'm only vaguely familiar with make, and have no idea about autoconf.

neumond commented 8 years ago

Yes, python part doesn't work under py3.4. Multiple import errors, shitty formatting style, wrong c library loading code, multiple errors on trying to parse something using adapters. And these errors seem to be omni-versional, not related to py3.

kevinhendricks commented 8 years ago

Sigil has a fixed version for python 3.4 along with a new beautiful soup 4 adapter that works with the version of the gumbo parser that has been specially modified for use inside Sigil. I am sure it could be easily fixed/adapted for the official gumbo parser. Let me know if you need or want a copy and I'll take a shot at adapting what we have to work here.

neumond commented 8 years ago

@kevinhendricks

Sigil

Always wondered how can I find these specific about-demonic pictograms. Are you talking about this code: https://github.com/Sigil-Ebook/sigil-gumbo/tree/master/python/gumbo ? Personally I'm interested to have working html5lib adapter, thanks for pointing onto possible working version. Fast inspection didn't make me happy though, it still has bad importing requiring adding changes in PYTHONPATH to work. Since it has __init__.py it tries to be a python package and it must use relative imports to behave well.

kevinhendricks commented 8 years ago

That version is for specific use inside of Sigil's plugin python 2.7 and python3.4 environment. It is set up to work with BS4 (also used internally by Sigil) not html5lib but the bulk of it should be adaptable.

On Mar 23, 2016, at 1:57 PM, Vitalik Verhovodov notifications@github.com wrote:

@kevinhendricks

Sigil

Always wondered how can I find these specific about-demonic pictograms. Are you talking about this code: https://github.com/Sigil-Ebook/sigil-gumbo/tree/master/python/gumbo ? Personally I'm interested to have working html5lib adapter, thanks for pointing onto possible working version. Fast inspection didn't make me happy though, it still has bad importing requiring adding changes in PYTHONPATH to work. Since it has init.py it tries to be a python package and it must use relative imports to behave well.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub

kevinhendricks commented 8 years ago

If it helps the sigil-python code we actually deploy to interface to gumbo is here:

https://github.com/Sigil-Ebook/Sigil/tree/master/src/Resource_Files/plugin_launchers/python

See:

sigil_gumboc.py sigil_gumboc_tags.py sigil_gumbo_bs4_adapter.py

On Mar 23, 2016, at 1:57 PM, Vitalik Verhovodov notifications@github.com wrote:

@kevinhendricks

Sigil

Always wondered how can I find these specific about-demonic pictograms. Are you talking about this code: https://github.com/Sigil-Ebook/sigil-gumbo/tree/master/python/gumbo ? Personally I'm interested to have working html5lib adapter, thanks for pointing onto possible working version. Fast inspection didn't make me happy though, it still has bad importing requiring adding changes in PYTHONPATH to work. Since it has init.py it tries to be a python package and it must use relative imports to behave well.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub

neumond commented 8 years ago

I'm rewriting the binding here https://github.com/neumond/scutigera/tree/master/scutigera/gumbo Decided to throw away overly complex Enum class, leaving in gumboc almost pure C interface. Wiped gumboc_tags file, no need to keep it, it can be dynamically obtained from dll. Now trying to write adapter which will conform html5lib test suite. Seems like gumbo fails in several tests, need some time to investigate these issues better. Most of tests are successful though.

Is it worth to use CFFI or Cython instead ctypes? PyPy recommends using CFFI (looks like they bet on JIT to optimize interaction) while Cython advertised in multiple articles across internet as fastest possible solution after pure CPython extension.

neumond commented 8 years ago

Hmm. Now the question is what encoding does gumbo consider to use internally. It accepts a buffer of bytes. Some of tests fail at decoding output as utf-8. Even if it treats input as ascii, which is enough to do HTML parsing, it does need some encoding choice to decipher html entities like &#x04521;

kevinhendricks commented 8 years ago

gumbo only works with properly utf-8 encoded html files. if an html file has any other encoding, it must be converted to utf-8 before being parsed by gumbo. See the readme on this site for details. Also the source being parsed must continue to exist be stored in memory as pointers into the original source exist in the parsed tree.

neumond commented 8 years ago

it must be converted to utf-8 before being parsed by gumbo

Exactly as I supposed it to be. Well, for test FOO&#111111111111 gumbo over ctypes binding outputs text node with b'FOO\xc7'. If I hack prettyprint example to output char codes:

      // line 195
      for (int j = 0; j < 8; j++){
          std::cout << ((int) child->v.text.text[j]) << std::endl;
      }

It's b'FOO\xc7\n' (70,79,79,-57,10,0). Using html5lib parser I get unicode point 65533 or b'FOO\xef\xbf\xbd' in utf-8.

Magic.

By the way, how can I check html5lib test suite, some tests look unreasonable for me. Id est gumbo works properly and html5lib test expects wrong, e.g. for noscript tag test.

UPD. It is an interesting character (65533) http://www.fileformat.info/info/unicode/char/0fffd/index.htm

REPLACEMENT CHARACTER used to replace an incoming character whose value is unknown or unrepresentable in Unicode

UPD2. Very interesting :)

b'FOO\xc7'.decode('utf-8', errors='replace').encode('utf-8')
b'FOO\xef\xbf\xbd'

UPD3. Now it's better.

168 failed, 15944 passed, 3056 skipped 124 failed, 15988 passed, 3056 skipped

gsnedders commented 8 years ago

b'FOO\xc7'.decode('utf-8', errors='replace').encode('utf-8')

As far as I'm aware, Gumbo's output shouldn't ever be invalid UTF-8. Certainly, per spec, Gumbo should be outputting U+FFFD for that, and definitely shouldn't be output something broken!

By the way, how can I check html5lib test suite, some tests look unreasonable for me. Id est gumbo works properly and html5lib test expects wrong, e.g. for noscript tag test.

A good starting point nowadays is look what your favourite browser does on the Live DOM Viewer though that doesn't work in the case of noscript tests with #script-off. Otherwise, uh, the best things are likely either asking in #whatwg on freenode or filing a bug on html5lib-tests. Or just ping me here.

neumond commented 8 years ago

Gumbo should be outputting U+FFFD for that

Ok, I think its time to dig gumbo code to repair this. Who knows maybe in some cases gumbo will output valid utf8 where it must output replacements.

Regarding noscript, that's one of obscure things I didn't know about. Considering this example

<p id="status"><noscript><strong>A</strong></noscript><span>B</span></p>

If I inspect DOM in firefox with javascript turned on I have <noscript> node with inner text "<strong>A</strong>". But if I disable javascript I have separate node of <strong> with text "A". Looks like gumbo behaves as if I disable javascript, but html5lib test expects it to be as if I enable javascript. There should be parameter for parse I guess.

kevinhendricks commented 8 years ago

So if you use a numeric entity like & # 111111111111 ; (which takes minimum 5 bytes to even represent as hex) or any other illegal unicode code point, the spec says to output UxFFFD? Is that right?

I know gumbo does output proper utf-8 encoded values for legal numeric entities.

For example: & # x F F F D ; results in the proper utf-8 byte string of 0xEF 0xBF 0xBD in the serialized output.

kevinhendricks commented 8 years ago

The problem is overflow of an int type in src/char_ref.c

int codepoint = 0;
  bool status = true;
  do {
    codepoint = (codepoint * (is_hex ? 16 : 10)) + digit;
    utf8iterator_next(input);
    digit = parse_digit(utf8iterator_current(input), is_hex);
  } while (digit != -1);

  if (utf8iterator_current(input) != ';') {
    add_codepoint_error(
        parser, input, GUMBO_ERR_NUMERIC_CHAR_REF_WITHOUT_SEMICOLON, codepoint);
    status = false;
  } else {
    utf8iterator_next(input);
  }

  int replacement = maybe_replace_codepoint(codepoint);
  if (replacement != -1) {
    add_codepoint_error(
        parser, input, GUMBO_ERR_NUMERIC_CHAR_REF_INVALID, codepoint);
    *output = replacement;
    return false;
  }

  if ((codepoint >= 0xd800 && codepoint <= 0xdfff) || codepoint > 0x10ffff) {
    add_codepoint_error(
        parser, input, GUMBO_ERR_NUMERIC_CHAR_REF_INVALID, codepoint);
    *output = 0xfffd;
    return false;
  }

Before each iteration for adding the next char digit it needs to check and prevent overflow of the codepoint value (anything greater than 0x10ffff) while still continuing to consume the bad numeric entity until it gets to a non-digit.

If you look at 111111111111 as hex (0x19debd01c7) it overflows an int type and the last byte value is the one you are seeing in the output (0xc7).

The overflow prevents this snippet of code from working:

if ((codepoint >= 0xd800 && codepoint <= 0xdfff) || codepoint > 0x10ffff) {
    add_codepoint_error(
        parser, input, GUMBO_ERR_NUMERIC_CHAR_REF_INVALID, codepoint);
    *output = 0xfffd;
    return false;
  }

The problem is char_ref.c is preprocessed to make char_ref.rl for speed, so once a proper fix is made, that the char_ref.rl will have to be recreated.

Hope this helps.

kevinhendricks commented 8 years ago

FWIW, since 0x10ffff * 16 easily fits inside an int, we do not need to catch int overlfow, we just need to catch overflow of 0x10ffff the first time but keep parsing until a non-digit. The final snippet (see above) will take care of the rest..

So this patch in char_ref.c did the trick for me:

--- char_ref.c.keep 2016-04-20 10:42:38.000000000 -0400
+++ char_ref.c  2016-04-20 10:48:08.000000000 -0400
@@ -166,8 +166,10 @@

   int codepoint = 0;
   bool status = true;
+  bool bad_value = false;
   do {
-    codepoint = (codepoint * (is_hex ? 16 : 10)) + digit;
+    if (!bad_value) codepoint = (codepoint * (is_hex ? 16 : 10)) + digit;
+    bad_value = codepoint > 0x10ffff;
     utf8iterator_next(input);
     digit = parse_digit(utf8iterator_current(input), is_hex);
   } while (digit != -1);
gsnedders commented 8 years ago

So if you use a numeric entity like & # 111111111111 ; (which takes minimum 5 bytes to even represent as hex) or any other illegal unicode code point, the spec says to output UxFFFD? Is that right?

That's right.

FWIW, that bug looks almost identical to the Gecko bug that led to those tests being written; the fix LGTM.

Ok, I think its time to dig gumbo code to repair this. Who knows maybe in some cases gumbo will output valid utf8 where it must output replacements.

I'd strongly encourage to use .decode("utf8", "strict") and catch a UnicodeDecodeError and treat it as a test failure, because any case where we have invalid UTF-8 is a bug in Gumbo.

Looks like gumbo behaves as if I disable javascript, but html5lib test expects it to be as if I enable javascript.

If there's a test that expects the script enabled parsing, it should have the #script-on flag (see https://github.com/html5lib/html5lib-tests/blob/master/tree-construction/README.md for documentation). If there's one that doesn't have it, please send a PR adding it! (The testsuite is predominantly run with scripting enabled, so I wouldn't be surprised if some tests were missing the needed flags.)

kevinhendricks commented 8 years ago

A simpler patch might be to remove bad_value and simply test if codepoint <= 0x10ffff before scaling the codepoint and adding the digit.

Either way once it exceeds 0x10ffff it will stop updating and prevent any overflow.

neumond commented 8 years ago

Trying to add script parameter into html5lib.

[SOLVED. Found .pytest.expect file, https://github.com/gsnedders/pytest-expect] Somehow all tests with script-off are masked with xfail. I've commented out DataLossWarning try-catcher, multiple tests started to fail, but script-off ones are still xfail-masked. Even grepping 'xfail' didn't help, there's no such text in whole project.

Command to run tests ignoring expect plugin: python -m pytest -s -p no:expect -m "ElementTree and namespaced" html5lib/tests/

neumond commented 8 years ago

23593 passed, 3064 skipped, 573 xfailed, 48 xpassed

Looks like 48 tests are working correctly now. How far did I go in just using gumbo in my project..

neumond commented 8 years ago

Regarding pullrequest for gumbo-parser. I don't know. I have version that works well with py3 and html5lib only. At least now I can import and use it, and it can use system-wide gumbo installation. I guess @kevinhendricks has good implementation for beautiful soup, not sure whether it's py2 or py3. It has many changes including testing through drop-in replacement of native html5lib parser and removal of gumboc_tags.py.

@nostrademons : what do you require for such PR? Do you require py2 and existing importing scheme?

kevinhendricks commented 8 years ago

beautiful soup 4 adapter is python3.

kevinhendricks commented 8 years ago

FWIW, to fix the issue of numeric overflow preventing invalid numeric entities (such as
& # 111111111111 ; from being successfully detected as an error: Sigil's gumbo has included the following changes in our older version of these files.

https://github.com/Sigil-Ebook/Sigil/commit/4a7afd4d216b14faea2135bb1ac105175bc8ae05

Nick-Alam commented 7 years ago

I fixed this problem by created a link from libgumbo.so.1.0.0 to gumbo/libgumbo.so By default, libgumbo.so.1.0.0 can be found at /usr/local/lib/, so

ln -s /usr/local/lib/libgumbo.so.1.0.0 /usr/local/lib/python2.7/dist-packages/gumbo-0.10.1-py2.7.egg/gumbo/libgumbo.so
rob70 commented 6 years ago

Nick-Alam's solution worked for me on Ubuntu14.04. pydoc3 gumbo lists the file in the gumbo package as: bs4_adapter bs4_adapter_test gumboc gumboc_tags gumboc_test html5lib_adapter html5lib_adapter_test libgumbo soup_adapter soup_adapter_test All of which are installed by the python3 setup.py install script except libgumbo.so .