philss / floki

Floki is a simple HTML parser that enables search for nodes using CSS selectors.
https://hex.pm/packages/floki
MIT License
2.07k stars 156 forks source link

Some attributes are binary #221

Closed bitboxer closed 5 years ago

bitboxer commented 5 years ago

I am writing a little crawler to fetch data from other pages and have a weird problem with this page:

This part:

<meta itemprop="name" content="Kästner für Erwachsene / 4 Bände: ..." />

Is represented as binary instead of a string inside of floki:

{"meta",
                      [
                        {"itemprop", "name"},
                        {"content",
                         <<75, 228, 115, 116, 110, 101, 114, 32, ...>>}
                      ], []},

My current guess is an encoding issue there. Is there a away to fix this inside of floki or should I try to get them to fix it?

philss commented 5 years ago

Hi @bitboxer! Thanks for open the issue.

The problem is that Elixir assumes that everything is using UTF8, and therefore cannot parse all the contents of files that are encoded in another format. This page is using ISO-8859-1, which is not fully understandable by Elixir.

This is not a easy task to fix on Floki's side, and it's something that I'm postponing in the new version of the parser because it's too complicated to detect and convert files.

Instead of fixing it in Floki, I recommend that you use some external tool like iconv. You can convert your file before using it in Elixir:

$ iconv -f ISO-8859-1 -t UTF-8 original-page.html > page-in-utf8.html

Alternatively you can try to convert the page using the iconv hex package, which is a biding to the iconv tool. But be aware that this is a C extension and can bring some problems to your application.

bitboxer commented 5 years ago

Ah, awesome. Thanks! Totally understandable. Will close this for now then and will use iconv in my parser project 🙇