Open divergentdave opened 6 years ago
More info:
I can reproduce the issue when using lxml
directly, without BeautifulSoup
(script below). Thus far I haven't been able to reproduce the issue when building from source, only when using the released wheels.
#!/usr/bin/env python
import sys
from lxml import etree
with open("tigta_utf8.txt", "r", encoding="utf-8") as f:
parser = etree.HTMLParser()
parser.feed(f.read())
doc = parser.close()
s = etree.tostring(doc)
print("round-tripped length:", len(s))
print("found spaces", b"b o d y" in s)
if len(s) > 100000:
print("bad, returning error exit code")
sys.exit(1)
else:
print("good, returning success exit code")
By building with make wheel_manylinux64
, I was able to get the above test case to succeed and fail on different versions. I bisected it, and the first bad commit is https://github.com/lxml/lxml/commit/9366980b16de135ebb213bc8cf3c5e499968b622, "Use latest libxml2 version in binary wheels of next release." This bumps the variable MANYLINUX_LIBXML2_VERSION
from 2.9.7 to 2.9.8. It seems the issue is a layer deeper, in libxml2 itself, and will require another round of bisecting.
This issue looks a lot like this Chromium bug https://bugs.chromium.org/p/chromium/issues/detail?id=820163#c39, which has been addressed in the development branch of libxml2. For now, I'll use lxml 4.1.1 as a workaround, and we can update once the various projects cut new releases.
Great detective work on this. That sounds like the right approach to me.
I ran into a weird error with the TIGTA scraper, I was only able to reproduce it locally after upgrading
lxml
from 4.0.0 to 4.2.1. The page https://www.treasury.gov/tigta/publications_congress.shtml contains ISO-8859-1/windows-1252 text, and we're already decoding it correctly. When the decoded text goes through BeautifulSoup/lxml though, it comes back out corrupted. If I run the snippet below, and then print outtext
, it looks okay, but if I print outdoc
, the beginning and end look reasonable, and a large section of the middle has three spaces inserted between each character. I suspect an internal UTF-32 representation is getting mishandled somewhere along the line. I've saved a sample locally in case we need to reproduce this later. I'll poke around the lxml and libxml2 issue trackers, or maybe bisect library versions, and see if I can get to the bottom of this.