unitedstates / inspectors-general

Collecting reports from Inspectors General across the US federal government.
https://sunlightfoundation.com/blog/2014/11/07/opengov-voices-opening-up-government-reports-through-teamwork-and-open-data/
Creative Commons Zero v1.0 Universal
107 stars 21 forks source link

[tigta] Parsing issue with lxml=4.2.1 #308

Open divergentdave opened 6 years ago

divergentdave commented 6 years ago

I ran into a weird error with the TIGTA scraper, I was only able to reproduce it locally after upgrading lxml from 4.0.0 to 4.2.1. The page https://www.treasury.gov/tigta/publications_congress.shtml contains ISO-8859-1/windows-1252 text, and we're already decoding it correctly. When the decoded text goes through BeautifulSoup/lxml though, it comes back out corrupted. If I run the snippet below, and then print out text, it looks okay, but if I print out doc, the beginning and end look reasonable, and a large section of the middle has three spaces inserted between each character. I suspect an internal UTF-32 representation is getting mishandled somewhere along the line. I've saved a sample locally in case we need to reproduce this later. I'll poke around the lxml and libxml2 issue trackers, or maybe bisect library versions, and see if I can get to the bottom of this.

>>> from bs4 import BeautifulSoup
>>> URL = "https://www.treasury.gov/tigta/publications_congress.shtml"
>>> import requests
>>> resp = requests.get(URL)
>>> resp.encoding = "iso-8859-1"
>>> text = resp.text
>>> doc = BeautifulSoup(text, "lxml")
divergentdave commented 6 years ago

More info:

divergentdave commented 6 years ago

I can reproduce the issue when using lxml directly, without BeautifulSoup (script below). Thus far I haven't been able to reproduce the issue when building from source, only when using the released wheels.

#!/usr/bin/env python
import sys

from lxml import etree

with open("tigta_utf8.txt", "r", encoding="utf-8") as f:
    parser = etree.HTMLParser()
    parser.feed(f.read())
    doc = parser.close()
s = etree.tostring(doc)
print("round-tripped length:", len(s))
print("found spaces", b"b   o   d   y" in s)
if len(s) > 100000:
    print("bad, returning error exit code")
    sys.exit(1)
else:
    print("good, returning success exit code")
divergentdave commented 6 years ago

By building with make wheel_manylinux64, I was able to get the above test case to succeed and fail on different versions. I bisected it, and the first bad commit is https://github.com/lxml/lxml/commit/9366980b16de135ebb213bc8cf3c5e499968b622, "Use latest libxml2 version in binary wheels of next release." This bumps the variable MANYLINUX_LIBXML2_VERSION from 2.9.7 to 2.9.8. It seems the issue is a layer deeper, in libxml2 itself, and will require another round of bisecting.

divergentdave commented 6 years ago

This issue looks a lot like this Chromium bug https://bugs.chromium.org/p/chromium/issues/detail?id=820163#c39, which has been addressed in the development branch of libxml2. For now, I'll use lxml 4.1.1 as a workaround, and we can update once the various projects cut new releases.

konklone commented 6 years ago

Great detective work on this. That sounds like the right approach to me.