Make kuchiki more forgiving about invalid encodings

kuchiki-rs / kuchiki

(朽木) HTML/XML tree manipulation library for Rust

MIT License

470 stars 54 forks source link

Make kuchiki more forgiving about invalid encodings #18

Closed untitaker closed 8 years ago

untitaker commented 8 years ago

At the moment I can't parse https://google.com (after resolving redirects), because some character there is invalid UTF-8, apparently.

SimonSapin commented 8 years ago

Could you give more details? Do you get a Result::Error return value? A panic? What does it say exactly? What are steps to reproduce?

untitaker commented 8 years ago

I get what seems to be an io::Error. From the server logs:

Io(Error { repr: Custom(Custom { kind: Other, error: StringError("Invalid UTF-8.") })

And indeed the recieved document contains very few invalid bytes. I'd wish for a mode to just skip them (since they are unlikely to be important for the kind of scraping I'll be doing), or replace them with a default character (similar to how Python's decoding modes work).

I suppose the issue can be triggered with any Read that yields invalid UTF-8, but in this particular case I was fetching the document from http://google.com (honoring redirects).

SimonSapin commented 8 years ago

Alright, I see what’s going on.

Kuchiki currently uses "strict" decoding, but it should be using "lossy" decoding that decodes invalid byte sequences as a U+FFFD replacement character. https://github.com/servo/tendril/pull/23 should help with that.

Another issue is that it doesn’t support character encodings other than UTF-8. The from_http method could look at the Content-Type header and there could be another method to override the encoding. Then there’s more detection that should be done on the content, that’s: https://github.com/servo/html5ever/issues/18

untitaker commented 8 years ago

Exactly, excellent, thank you!

SimonSapin commented 8 years ago

https://github.com/servo/html5ever/pull/188 is waiting for review. Once that lands, I can finish https://github.com/SimonSapin/kuchiki/pull/19 and clean up the API a bit so that you can use e.g.:

use kuchiki::{parse_html, ParseOpts};
let document = try!(parse_html(ParseOpts::default()).from_http("https://google.com"));

This will pick a character encoding based on, in order:

A byte order mark
A charset parameter of a Content-Type response header
Fall back to UTF-8

More work is needed in html5ever to detect <meta charset=…> elements etc. But that should be doable without changing the API further.

SimonSapin commented 8 years ago

Alright, I published kuchiki 0.3. The usage is something like:

use kuchiki::traits::*;
use kuchiki::parse_html;
let document = try!(parse_html().from_http("https://google.com"));

Let me know if this solves your issue.

untitaker commented 8 years ago

Sorry for late response -- from_http doesn't seem to be a thing.

untitaker commented 8 years ago

Nevermind, forgot the feature flag. Lemme check again...

untitaker commented 8 years ago

Okay, this broke a site I am trying to parse (that worked fine before IIRC) Concretely, https://taskrs.5apps.com/ fails. Here's a rudimentary patch against the testsuite: https://gist.github.com/untitaker/e9a4426a742f0c4a6bfe

~~The last assertion always fails because the document is certainly not empty, but it should contain more than a barebones document.~~ Updated with better assertion.

I'm still trying to figure out what's wrong.

untitaker commented 8 years ago

I've found nothing. The document is just ASCII and otherwise I also don't understand what's wrong with it. What am I missing?

untitaker commented 8 years ago

Updated the gist again with a simplified testcase. It seems to me that parsing from bytes is deeply broken.

SimonSapin commented 8 years ago

Oops! This one is my fault: https://github.com/servo/html5ever/pull/190

SimonSapin commented 8 years ago

Alright, a new version with that fix is published, cargo update -p html5ever should fix this.

untitaker commented 8 years ago

Excellent, that fixes it, and as far as I can tell the sites I care about work. Thank you.