Closed bitboxer closed 5 years ago
Hi @bitboxer! Thanks for open the issue.
The problem is that Elixir assumes that everything is using UTF8, and therefore cannot parse all the contents of files that are encoded in another format. This page is using ISO-8859-1
, which is not fully understandable by Elixir.
This is not a easy task to fix on Floki's side, and it's something that I'm postponing in the new version of the parser because it's too complicated to detect and convert files.
Instead of fixing it in Floki, I recommend that you use some external tool like iconv
. You can convert your file before using it in Elixir:
$ iconv -f ISO-8859-1 -t UTF-8 original-page.html > page-in-utf8.html
Alternatively you can try to convert the page using the iconv hex package
, which is a biding to the iconv
tool. But be aware that this is a C extension and can bring some problems to your application.
Ah, awesome. Thanks! Totally understandable. Will close this for now then and will use iconv in my parser project 🙇
I am writing a little crawler to fetch data from other pages and have a weird problem with this page:
This part:
Is represented as binary instead of a string inside of floki:
My current guess is an encoding issue there. Is there a away to fix this inside of floki or should I try to get them to fix it?