Closed untitaker closed 8 years ago
Could you give more details? Do you get a Result::Error
return value? A panic? What does it say exactly? What are steps to reproduce?
I get what seems to be an io::Error. From the server logs:
Io(Error { repr: Custom(Custom { kind: Other, error: StringError("Invalid UTF-8.") })
And indeed the recieved document contains very few invalid bytes. I'd wish for a mode to just skip them (since they are unlikely to be important for the kind of scraping I'll be doing), or replace them with a default character (similar to how Python's decoding modes work).
I suppose the issue can be triggered with any Read
that yields invalid UTF-8, but in this particular case I was fetching the document from http://google.com
(honoring redirects).
Alright, I see what’s going on.
Kuchiki currently uses "strict" decoding, but it should be using "lossy" decoding that decodes invalid byte sequences as a U+FFFD replacement character. https://github.com/servo/tendril/pull/23 should help with that.
Another issue is that it doesn’t support character encodings other than UTF-8. The from_http
method could look at the Content-Type
header and there could be another method to override the encoding. Then there’s more detection that should be done on the content, that’s: https://github.com/servo/html5ever/issues/18
Exactly, excellent, thank you!
https://github.com/servo/html5ever/pull/188 is waiting for review. Once that lands, I can finish https://github.com/SimonSapin/kuchiki/pull/19 and clean up the API a bit so that you can use e.g.:
use kuchiki::{parse_html, ParseOpts};
let document = try!(parse_html(ParseOpts::default()).from_http("https://google.com"));
This will pick a character encoding based on, in order:
charset
parameter of a Content-Type
response headerMore work is needed in html5ever to detect <meta charset=…>
elements etc. But that should be doable without changing the API further.
Alright, I published kuchiki 0.3. The usage is something like:
use kuchiki::traits::*;
use kuchiki::parse_html;
let document = try!(parse_html().from_http("https://google.com"));
Let me know if this solves your issue.
Sorry for late response -- from_http
doesn't seem to be a thing.
Nevermind, forgot the feature flag. Lemme check again...
Okay, this broke a site I am trying to parse (that worked fine before IIRC) Concretely, https://taskrs.5apps.com/
fails. Here's a rudimentary patch against the testsuite: https://gist.github.com/untitaker/e9a4426a742f0c4a6bfe
The last assertion always fails because the document is certainly not empty, but it should contain more than a barebones document. Updated with better assertion.
I'm still trying to figure out what's wrong.
I've found nothing. The document is just ASCII and otherwise I also don't understand what's wrong with it. What am I missing?
Updated the gist again with a simplified testcase. It seems to me that parsing from bytes is deeply broken.
Oops! This one is my fault: https://github.com/servo/html5ever/pull/190
Alright, a new version with that fix is published, cargo update -p html5ever
should fix this.
Excellent, that fixes it, and as far as I can tell the sites I care about work. Thank you.
At the moment I can't parse
https://google.com
(after resolving redirects), because some character there is invalid UTF-8, apparently.