rusterlium / html5ever_elixir

NIF wrapper of html5ever using Rustler
https://hexdocs.pm/html5ever
Apache License 2.0
81 stars 71 forks source link

Parsing non-UTF-8 pages #6

Open edevil opened 7 years ago

edevil commented 7 years ago

Parsing pages not written in UTF-8 currently produces errors:

> %HTTPoison.Response{body: body} = HTTPoison.get!("http://manybooks.net/index.xml")
> Html5ever.parse(body)

thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Utf8Error { valid_up_to: 4070 }', src/libcore/result.rs:859
note: Run with `RUST_BACKTRACE=1` for a backtrace.
{:error, "called `Result::unwrap()` on an `Err` value: Utf8Error { valid_up_to: 4070 }"}

In this case this XML feed has the encoding in the xml preeamble:

<?xml version="1.0" encoding="iso-8859-1"?>
<rss version="2.0">
...

Can I get around this problem or can the library be fixed to handle this situation?

mischov commented 7 years ago

I'll leave the broader question of "can the library be fixed to handle this situation?" to Hans, but-

Can I get around this problem

Yeah, to some definition of get around.

body
|> Codepagex.to_string!(:iso_8859_1)
|> Html5ever.parse()
edevil commented 7 years ago

Thanks, @mischov!

hansihe commented 7 years ago

Going to keep this open, I would still like to find a proper solution for this.

As far as I can tell, html5ever does not support detecting encoding yet. See this issue.