Closed edevil closed 7 years ago
Hi @edevil! Sorry for the delay. Today Floki uses the Mochiweb as the HTML parser. As mentioned in the Mochiweb project, it does not support other encodings there are not UTF-8.
Please try @mischov ov 's suggestion to convert your document to UTF-8: https://github.com/hansihe/html5ever_elixir/issues/6#issuecomment-305258875
Thanks!
Ok, thanks!
Hi, A few years passed by since this thread was posted... Is there any update to this issue? I also need to parse a charset=windows-1252 page and am not sure how to do it. This reply url is not very clear also, but is this the way to go? Thank you
Hey @nuno84 :wave: This is still a problem, since Floki does not implement the algorithm for detecting the encoding of the page.
But what @mischov suggested there is that you could use the Codepagex Hex package to convert from your encoding to UTF8.
I think you can archive the same result without that package, by using the :unicode
module from Erlang:
html = :unicode.characters_to_binary(your_html, :latin1)
Floki.parse_document!(html)
Since latin1
(or ISO 8859-1) is a superset of window-1252
this should work.
Hi again Filipe, I tested and both solutions solve the issue. I will keep yours as it is one less dependency 👍 One last question: Is there any way to detect if I need to decode the HTML? Do I need to do some regex on the HTML head to look for encoding property or something similar? Any thoughts on that? Thank you very much.
I was thinking about this.
Isn't the encoding on the meta in the head?
It doesnt seem that difficult to do a regex and apply that transform for a list of encodings, or am I oversimplifying it?
I could work on it... maybe through and apply_auto_encode
option to make it optionable?
what are your thoughts?
Ok, for future reference, I found that that conversion is not complete.
The € symbol doesnt work with iso-8859-1, so neither solutions worked 100%.
I installed the package: {:tds_encoding, "~> 1.1"}
Tds.Encoding.decode(body, encoding)
And now it worked.
But this installed a lot of stuff and I am not that happy with such bigger dependency:
==> toml
Compiling 10 files (.ex)
Generated toml app
==> rustler
Compiling 7 files (.ex)
Generated rustler app
==> tds_encoding
Compiling 1 file (.ex)
Updating crates.io index
Compiling lib/tds_encoding.ex (it's taking more than 10s)B/s
Downloaded quote v1.0.9
Downloaded void v1.0.2
Downloaded unicode-xid v0.2.2
Downloaded lazy_static v1.4.0
Downloaded encoding-index-simpchinese v1.20141219.5
Downloaded unicode-segmentation v1.6.0
Downloaded encoding-index-singlebyte v1.20141219.5
Downloaded rustler_sys v2.1.1
Downloaded heck v0.3.1
Downloaded rustler v0.22.0
Downloaded rustler_codegen v0.22.0
Downloaded encoding v0.2.33
Downloaded proc-macro2 v1.0.29
Downloaded syn v1.0.77
Downloaded encoding_index_tests v0.1.4
Downloaded unreachable v1.0.0
Downloaded encoding-index-korean v1.20141219.5
Downloaded encoding-index-japanese v1.20141219.5
Downloaded encoding-index-tradchinese v1.20141219.5
Downloaded 19 crates (1.1 MB) in 1.53s
Compiling crate tds_encoding in release mode (native/tds_encoding)
Compiling encoding_index_tests v0.1.4
Compiling proc-macro2 v1.0.29
Compiling unicode-xid v0.2.2
Compiling syn v1.0.77
Compiling unicode-segmentation v1.6.0
Compiling rustler_sys v2.1.1
Compiling void v1.0.2
Compiling rustler v0.22.0
Compiling lazy_static v1.4.0
Compiling encoding-index-tradchinese v1.20141219.5
Compiling encoding-index-simpchinese v1.20141219.5
Compiling encoding-index-korean v1.20141219.5
Compiling encoding-index-japanese v1.20141219.5
Compiling encoding-index-singlebyte v1.20141219.5
Compiling unreachable v1.0.0
Compiling heck v0.3.1
Compiling encoding v0.2.33
Compiling quote v1.0.9
Compiling rustler_codegen v0.22.0
Compiling tds_encoding v0.2.0 (C:\Users\Asus\Documents\Business\Phoenix\Projects\stageagenda_umbrella\deps\tds_encoding\native\tds_encoding)
Finished release [optimized] target(s) in 23.12s
Generated tds_encoding app
Any suggestion? At least it seems to be working now. Thank you
Ok, for future reference, I found that that conversion is not complete. The € symbol doesnt work with iso-8859-1, so neither solutions worked 100%.
Sorry, I swapped the things. Actually windows-1252
is a superset of ISO 8859-1.
Isn't the encoding on the meta in the head? It doesnt seem that difficult to do a regex and apply that transform for a list of encodings, or am I oversimplifying it?
No, unfortunately it is not that simple. See the algorithm description here: https://html.spec.whatwg.org/#determining-the-character-encoding
I installed the package: {:tds_encoding, "~> 1.1"} Tds.Encoding.decode(body, encoding) And now it worked. But this installed a lot of stuff and I am not that happy with such bigger dependency:
I see. This is because that dependency is using Rustler, but without precompilation. I think a solution would be to propose the usage of Rustler Precompiled there. I can help with that if you want :) But should be really straightforward if you follow the examples.
I'm also planning to create another package for that, but I haven't been able to focus on that.
But I have one question: are you trying to parse random pages from the internet? Or do you have some specific target that uses this specific encoding (windows-1252)?
I also thought if it was simple it would be done a long time ago. I am parsing random pages, that is why some will eventually have "weird" encodings but I can specify each encoding by hand, no problem with that as the process will always be individually made. Ok, I can try the precompiled lib you did, but I dont understand it: I will add to deps:
...
{:rustler_precompiled, "~> 0.5"},
{:rustler, "~> 0.23.0", optional: true},
{:tds_encoding, "~> 1.1"}
...
mix deps thwors an error:
Failed to use "rustler" because
apps/my_app/mix.exs requires ~> 0.22.0
rustler_precompiled (version 0.5.1) requires ~> 0.23
mix.lock specifies 0.22.2
Added module:
defmodule MyApp.RustlerNative do
version = Mix.Project.config()[:version]
use RustlerPrecompiled,
otp_app: :my_app,
base_url:
"https://github.com/philss/rustler_precompilation_example/releases/download/v#{version}",
force_build: System.get_env("RUSTLER_PRECOMPILATION_EXAMPLE_BUILD") in ["1", "true"],
version: version
# When your NIF is loaded, it will override this function.
def add(_a, _b), do: :erlang.nif_error(:nif_not_loaded)
end
Now I can call the function as usuall?
Tds.Encoding.decode(body, encoding)
Is this the process or am I missing something? I read the blog post and example you did. The deps are failing but I don't know if I should try a lower version on rustler_precompiled ?? Is the work of using precompiled worth it? Is it because CI tests start verything from ground up every single pass? Am I understanding it right? Thank you once again Philip.
For reference, This is solved by using the lib: {:excoding, "~> 0.1.2"}, Excoding.decode(body, encoding) More info https://github.com/philss/floki/issues/116#issuecomment-1205577086 Thank you once again.
If we're parsing an XML file with an encoding:
Example:
This example was taken from: "http://manybooks.net/index.xml"