str/bytes from requests response interaction with lxml

gugek commented 7 years ago

Is there a reason that the utf-8 (str/unicode) text is being used in the XML property/method in response.py?

I'm finding lxml is having problems with unicode strings (bytes in py3) that include external entities or non-ascii/latin characters.

Here is an example record: there are right single quotes \u2019 embedded in there.

<record>
    <header>
        <identifier>oai:scholarship.law.duke.edu:dlj-3910</identifier>
        <datestamp>2017-10-04T15:03:25Z</datestamp>
        <setSpec>publication:journals</setSpec>
        <setSpec>publication:dlj</setSpec>
    </header>
    <metadata>
        <oai_dc:dc
            xmlns:oai_dc="https://www.openarchives.org/OAI/2.0/oai_dc/"
            xmlns:dc="http://purl.org/dc/elements/1.1/"
            xmlns:xsi="https://www.w3.org/2001/XMLSchema-instance"
            xsi:schemaLocation="https://www.openarchives.org/OAI/2.0/oai_dc/ https://www.openarchives.org/OAI/2.0/oai_dc.xsd">
            <dc:title>Rule 24 Notwithstanding: Why Article III Should Not Limit Intervention of Right</dc:title>
            <dc:creator>Ferguson, Zachary N.</dc:creator>
            <dc:description>The Supreme Court recently decided in Town of Chester v. Laroe Estates, Inc. that intervenors of right under Federal Rule of Civil Procedure 24(a)(2) must demonstrate independent Article III standing when they pursue relief different from that requested by an original plaintiff. This decision resolved, in part, a decades-long controversy among the Courts of Appeals over the proper relationship between Rule 24 intervention and Article III standing that the Court first acknowledged in Diamond v. Charles. But the Court’s narrow decision in Town of Chester hardly disposed of the controversy, and Courts of Appeals are still free to require standing of defendant-intervenors and, it stands to reason, plaintiff-intervenors even if they do not pursue different relief. With this debate yet unresolved, this Note takes a less conventional approach. In addition to arguing that the Supreme Court’s precedents implicitly resolved this question before Town of Chester, this Note argues that the nature of judicial decisions raises two concerns that a liberal application of Rule 24(a)(2) would mitigate. First, this Note argues that stare decisis limits the right of litigants to be heard on the merits of their claims and defenses in a way that undermines the principles of due process. Second, this Note argues that the process of judicial decisionmaking is fraught with potential epistemic problems that can produce suboptimal legal rules. After considering these two concerns, this Note argues that Rule 24(a)(2) is a better and more practical way to mitigate these problems than are Rule 24(a)(2)’s alternatives.</dc:description>
            <dc:date>2017-10-04T07:00:00Z</dc:date>
            <dc:type>text</dc:type>
            <dc:format>application/pdf</dc:format>
            <dc:identifier >http://scholarship.law.duke.edu/dlj/vol67/iss1/4</dc:identifier>
            <dc:identifier>http://scholarship.law.duke.edu/cgi/viewcontent.cgi?article=3910&#38;amp;context=dlj</dc:identifier>
            <dc:source>Duke Law Journal</dc:source>
            <dc:publisher>Duke University School of Law</dc:publisher>
            <dc:subject>Law</dc:subject>

        </oai_dc:dc>
    </metadata>
</record>

lxml does some optimizations in py2 where it sometimes will output a unicode object and other times a text one. In py3 it will always output unicode. But for whatever reason when you get the text from the requests response, encode it back to str/bytes and then parse it with lxml.etree some unicode (and maybe external entities) are getting processed incorrectly.

in: sickle/sickle/response.py

    @property
    def xml(self):
        """The server's response as parsed XML."""
        return etree.XML(self.http_response.text.encode("utf8"),
                         parser=XMLParser)

I think your tests are passing because you are reading from a file object which is directly providing string/byte, though you may not have coverage of anything past the ASCII character space.

I think the simple fix is in sickle/sickle/response.py to just parse the response content rather than encoding the text which is already being processed by requests and which needs to be encoded to be handled by lxml.etree

    @property
    def xml(self):
        """The server's response as parsed XML."""
        return etree.XML(self.http_response.content,
                         parser=XMLParser)

sourcefilter commented 6 years ago

I'm having the same problem (in py3) – put in unicode strings with non-ascii characters, get back unicode strings with weird escape sequences.

We're all a little confused about str / bytes, and supporting unicode on python 2 and 3 must be a nightmare, so I just wanted to offer my own explanation of why @gugek's fix works:

Py3 str is a unicode string, independent of encoding.
Py3 bytes are binary data, which could be a particular encoding of a unicode string, or anything else.
Best practice is to decode bytes to str ASAP after loading from disk/network, only encode back to bytest at the last possible moment before saving/sending.
Since requests (I think) handles the decoding, Sickle should be able to work exclusively in strings in py3, and never use .encode() or .decode().

My understanding largely comes from Ch. 4 of this book: http://shop.oreilly.com/product/0636920032519.do

And please correct me if I'm wrong! I get as confused by unicode as everyone else!

sourcefilter commented 6 years ago

Okay, it looks like sickle isn't at fault – at least in my case, requests is wrongly detecting ISO-8859-1 instead of UTF-8.

sourcefilter commented 6 years ago

Also, I didn't realize that lxml requires bytes – so that explains why Sickle encodes its input. So sorry for getting all pedantic about that.

mloesch commented 6 years ago

Fixed in the latest release 0.6.3

mloesch / sickle

str/bytes from requests response interaction with lxml #20