Avoid reencoding all incoming data to UTF-8

GarthSnyder commented 5 years ago

Hi! This is a followup to #51 and #43.

I believe @CD1212 and I both encountered a similar issue where feed type detection would fail because the document encoding was not UTF-8. And we both submitted patches to address this. Chris's predated mine by about six months, but we were integrated at around the same time, and because of the way I described my patch, it probably wasn't apparent that we were talking about the same problem.

So, both patches are now in master and there's some code that could be pruned.

Chris's patch converts all documents to UTF-8 on entry to the parser and, with respect, I'd argue that this is not correct.

JSON documents are required by RFC8259 to use UTF-8 encoding already. The previous JSON standard, RFC7159, also allowed UTF-16 and UTF-32, but strongly suggested UTF-8. As a practical matter, I have not encountered any non-UTF-8 JSON documents, although I'm sure a few exist somewhere.

XML allows a wide variety of encodings, including many that are not Unicode. And the encoding is specified in the document declaration, e.g.,

<?xml version="1.0" encoding="ISO-8859-1"?>

If you reencode the document to UTF-8, you now have an XML document that declares itself to be in one encoding while actually using another. I do not know to what extent Apple's XMLParser actually relies on the declared encoding, but it doesn't really matter: if it obeys the declared encoding, then preconverting to UTF-8 is potentially problematic; if it doesn't, then XMLParser is doing its own encoding detection and conversion, and preconversion is unnecessary. (XMLParser transmits the document to its delegate in the form of strings, which have a Unicode API. There's no risk of receiving a string that's "really" in, e.g., undecoded Shift-JIS; encodings are normalized at the point where data enters the String domain.)

I believe the problem that formerly caused parsing to fail on some XML documents was that document type detection would assume UTF-8 encoding and abort if the document didn't decode properly that way. But this is fixed in the current version, which uses lenient UTF-8 decoding. Unconvertible characters don't matter because the dispositive characters are in the plain ASCII range and so they are coded identically in all the common schemes.

nmdias commented 5 years ago

Hi, @GarthSnyder

I'm inclined to agree, that forcefully converting to utf8 is not the most sane approach.

There was actually an issue #62 where XMLParser was fed an ISO-8859-1 encoded file and it kept breaking outside of my control. Documentation for XMLParser also doesn't seem to refer how it handles encoding, or what it expects. Except for when it finds CDATA blocks, where it states that the encoding of such data is utf8, but that doesn't clarify anything really...

I might one day consider exploring moving on from XMLParser to libxml2. But for the time being I don't see any immediate benefits.

I'm going to merge this now.

Thank you for the awesome contribution, both code and reasoning. 🙏

Cheers

GarthSnyder commented 5 years ago

Thanks for that pointer - I hadn't seen the issue you mention and it looks like it might be directly relevant. It sounds like you already traced the problem down into the bowels of XMLParser, but let me take a look at it as well just as a second pair of eyes.

I added a test case for that feed, and currently it is still failing.

GarthSnyder commented 5 years ago

Ha! I see what's going on here: this feed contains unescaped ampersands, which are not legal in XML.

I have seen this in a lot of feeds and have a work-around in my own app that tries to reparse after escaping the ampersands. But as you can see, it is really just a giant hack that assumes, among other things, UTF-8 encoding 😳:

// Some feeds have contraband '&' characters. If parsing fails, we want
// to retry after escaping those ampersands. However, if that parse attempt
// also fails, we want to return the original error, not the error associated
// with the parsing of the ampersand-escaped version of the data.

private func parseFeedStream(_ stream: DataStream) throws -> GenericFeed {

    let parser = FeedParser(xmlStream: stream)
    do {
        return try parseAndCheck(parser, stream: stream)
    } catch let outerError {
        // Groom the XML and reattempt
        print("Reattempting...")
        do {
            guard let data = stream.getAllData() else { throw FeedReaderError.couldNotObtainAllData }
            let groomedData = try data.groomXML()
            let backupParser = FeedParser(data: groomedData)
            return try parseAndCheck(backupParser)
        } catch {
            throw outerError
        }
    }

}

extension Data {

    func groomXML() throws -> Data {

        // Escape ampersands
        let dataString = String(decoding: self, as: UTF8.self)
        let ampersands = #"&(?![\w#\d]{2,8};)"# // All ampersands not already part of an encoding
        let groomedString = dataString.replacingOccurrences(of: ampersands, with: "&amp;", options: .regularExpression)

        guard let groomed = groomedString.data(using: .utf8, allowLossyConversion: true) else {
            throw DataError.couldNotGroom
        }
        return groomed
    }

}

I didn't submit anything back to FeedKit because I don't really know the proper solution. The regex thing is super-sketchy. Arguably, the feeds are overtly noncompliant and should fail parsing. But I'd imagine nobody really wants that.

nmdias commented 5 years ago

That was quick! I don't think I would easily catch that. Yup, there are a lot of urls with unescaped ampersands in that feed. Thank you for sharing that!

btw, the naming on that function groomXML is killing me 🤣

nmdias / FeedKit

Avoid reencoding all incoming data to UTF-8 #83