philss / floki

Floki is a simple HTML parser that enables search for nodes using CSS selectors.
https://hex.pm/packages/floki
MIT License
2.07k stars 156 forks source link

Encoding is not taken into account when parsing file #116

Closed edevil closed 7 years ago

edevil commented 7 years ago

If we're parsing an XML file with an encoding:

<?xml version="1.0" encoding="iso-8859-1"?>
<rss version="2.0">
...

Example:

> Enum.at(Floki.find(body, "description"), 7)
{"description", [],
 [<<60, 112, 62, 65, 32, 66, 97, 114, 98, 97, 114, 97, 32, 77, 97, 114, 114, 32,
    77, 117, 114, 100, 101, 114, 32, 77, 121, 115, 116, 101, 114, 121, 44, 32,
    66, 111, 111, 107, 32, 49, 32, 60, 47, 112, 62, 60, 112, 62, ...>>]}

This example was taken from: "http://manybooks.net/index.xml"

philss commented 7 years ago

Hi @edevil! Sorry for the delay. Today Floki uses the Mochiweb as the HTML parser. As mentioned in the Mochiweb project, it does not support other encodings there are not UTF-8.

Please try @mischov ov 's suggestion to convert your document to UTF-8: https://github.com/hansihe/html5ever_elixir/issues/6#issuecomment-305258875

Thanks!

edevil commented 7 years ago

Ok, thanks!

nuno84 commented 2 years ago

Hi, A few years passed by since this thread was posted... Is there any update to this issue? I also need to parse a charset=windows-1252 page and am not sure how to do it. This reply url is not very clear also, but is this the way to go? Thank you

philss commented 2 years ago

Hey @nuno84 :wave: This is still a problem, since Floki does not implement the algorithm for detecting the encoding of the page.

But what @mischov suggested there is that you could use the Codepagex Hex package to convert from your encoding to UTF8. I think you can archive the same result without that package, by using the :unicode module from Erlang:

html = :unicode.characters_to_binary(your_html, :latin1)

Floki.parse_document!(html)

Since latin1 (or ISO 8859-1) is a superset of window-1252 this should work.

nuno84 commented 2 years ago

Hi again Filipe, I tested and both solutions solve the issue. I will keep yours as it is one less dependency 👍 One last question: Is there any way to detect if I need to decode the HTML? Do I need to do some regex on the HTML head to look for encoding property or something similar? Any thoughts on that? Thank you very much.

nuno84 commented 2 years ago

I was thinking about this. Isn't the encoding on the meta in the head? It doesnt seem that difficult to do a regex and apply that transform for a list of encodings, or am I oversimplifying it? I could work on it... maybe through and apply_auto_encode option to make it optionable? what are your thoughts?

nuno84 commented 2 years ago

Ok, for future reference, I found that that conversion is not complete. The € symbol doesnt work with iso-8859-1, so neither solutions worked 100%. I installed the package: {:tds_encoding, "~> 1.1"} Tds.Encoding.decode(body, encoding) And now it worked. But this installed a lot of stuff and I am not that happy with such bigger dependency:

==> toml
Compiling 10 files (.ex)
Generated toml app
==> rustler
Compiling 7 files (.ex)
Generated rustler app
==> tds_encoding
Compiling 1 file (.ex)
    Updating crates.io index
Compiling lib/tds_encoding.ex (it's taking more than 10s)B/s
  Downloaded quote v1.0.9
  Downloaded void v1.0.2
  Downloaded unicode-xid v0.2.2
  Downloaded lazy_static v1.4.0
  Downloaded encoding-index-simpchinese v1.20141219.5
  Downloaded unicode-segmentation v1.6.0
  Downloaded encoding-index-singlebyte v1.20141219.5
  Downloaded rustler_sys v2.1.1
  Downloaded heck v0.3.1
  Downloaded rustler v0.22.0
  Downloaded rustler_codegen v0.22.0
  Downloaded encoding v0.2.33
  Downloaded proc-macro2 v1.0.29
  Downloaded syn v1.0.77
  Downloaded encoding_index_tests v0.1.4
  Downloaded unreachable v1.0.0
  Downloaded encoding-index-korean v1.20141219.5
  Downloaded encoding-index-japanese v1.20141219.5
  Downloaded encoding-index-tradchinese v1.20141219.5
  Downloaded 19 crates (1.1 MB) in 1.53s
Compiling crate tds_encoding in release mode (native/tds_encoding)
   Compiling encoding_index_tests v0.1.4
   Compiling proc-macro2 v1.0.29
   Compiling unicode-xid v0.2.2
   Compiling syn v1.0.77
   Compiling unicode-segmentation v1.6.0
   Compiling rustler_sys v2.1.1
   Compiling void v1.0.2
   Compiling rustler v0.22.0
   Compiling lazy_static v1.4.0
   Compiling encoding-index-tradchinese v1.20141219.5
   Compiling encoding-index-simpchinese v1.20141219.5
   Compiling encoding-index-korean v1.20141219.5
   Compiling encoding-index-japanese v1.20141219.5
   Compiling encoding-index-singlebyte v1.20141219.5
   Compiling unreachable v1.0.0
   Compiling heck v0.3.1
   Compiling encoding v0.2.33
   Compiling quote v1.0.9
   Compiling rustler_codegen v0.22.0
   Compiling tds_encoding v0.2.0 (C:\Users\Asus\Documents\Business\Phoenix\Projects\stageagenda_umbrella\deps\tds_encoding\native\tds_encoding)
    Finished release [optimized] target(s) in 23.12s
Generated tds_encoding app

Any suggestion? At least it seems to be working now. Thank you

philss commented 2 years ago

Ok, for future reference, I found that that conversion is not complete. The € symbol doesnt work with iso-8859-1, so neither solutions worked 100%.

Sorry, I swapped the things. Actually windows-1252 is a superset of ISO 8859-1.

Isn't the encoding on the meta in the head? It doesnt seem that difficult to do a regex and apply that transform for a list of encodings, or am I oversimplifying it?

No, unfortunately it is not that simple. See the algorithm description here: https://html.spec.whatwg.org/#determining-the-character-encoding

I installed the package: {:tds_encoding, "~> 1.1"} Tds.Encoding.decode(body, encoding) And now it worked. But this installed a lot of stuff and I am not that happy with such bigger dependency:

I see. This is because that dependency is using Rustler, but without precompilation. I think a solution would be to propose the usage of Rustler Precompiled there. I can help with that if you want :) But should be really straightforward if you follow the examples.

I'm also planning to create another package for that, but I haven't been able to focus on that.

But I have one question: are you trying to parse random pages from the internet? Or do you have some specific target that uses this specific encoding (windows-1252)?

nuno84 commented 2 years ago

I also thought if it was simple it would be done a long time ago. I am parsing random pages, that is why some will eventually have "weird" encodings but I can specify each encoding by hand, no problem with that as the process will always be individually made. Ok, I can try the precompiled lib you did, but I dont understand it: I will add to deps:

...
      {:rustler_precompiled, "~> 0.5"},
      {:rustler, "~> 0.23.0", optional: true},
      {:tds_encoding, "~> 1.1"}
...

mix deps thwors an error:

Failed to use "rustler" because
  apps/my_app/mix.exs requires ~> 0.22.0
  rustler_precompiled (version 0.5.1) requires ~> 0.23
  mix.lock specifies 0.22.2

Added module:

defmodule MyApp.RustlerNative do
  version = Mix.Project.config()[:version]

  use RustlerPrecompiled,
    otp_app: :my_app,
    base_url:
      "https://github.com/philss/rustler_precompilation_example/releases/download/v#{version}",
    force_build: System.get_env("RUSTLER_PRECOMPILATION_EXAMPLE_BUILD") in ["1", "true"],
    version: version

  # When your NIF is loaded, it will override this function.
  def add(_a, _b), do: :erlang.nif_error(:nif_not_loaded)
end

Now I can call the function as usuall? Tds.Encoding.decode(body, encoding)

Is this the process or am I missing something? I read the blog post and example you did. The deps are failing but I don't know if I should try a lower version on rustler_precompiled ?? Is the work of using precompiled worth it? Is it because CI tests start verything from ground up every single pass? Am I understanding it right? Thank you once again Philip.

nuno84 commented 2 years ago

For reference, This is solved by using the lib: {:excoding, "~> 0.1.2"}, Excoding.decode(body, encoding) More info https://github.com/philss/floki/issues/116#issuecomment-1205577086 Thank you once again.