martinsvalin / html_entities

Elixir module for decoding HTML entities.
MIT License
88 stars 24 forks source link

Percent html entity does not decoded #25

Open paveltyk opened 4 years ago

paveltyk commented 4 years ago

Expected: HtmlEntities.decode("100%") #=> "100%" Actual: HtmlEntities.decode("100%") #=> "100%"

paveltyk commented 4 years ago

This seems to be a comprehensive of HTML entities: https://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references

I can build I mapping file, if you are willing to use it in your project.

martinsvalin commented 4 years ago

Hi, thank you for pointing this out. The list you referenced is actually what I used to generate this file, which is then used as a source for all the function clauses to cover these named entities.

The wikipedia page has since been updated to include entities defined in HTML 5.0, growing the list from a few hundred to a few thousand entities.

It's a reasonable addition, but I'll think about if this can be done in a nice way so that users who only need to decode old documents from back when entities were more commonplace can have a slimmer, more performant dependency. Functionally it's a backwards compatible change, but there will be some cost in performance and compiled file size. At least I need to check what the impact is on size and performance.

Where did you find a document in the wild with HTML 5.0 entities in it? I'm a little bit surprised as I don't see good reasons to encode characters beyond the ones needed to produce html-safe text these days.

paveltyk commented 4 years ago

We do web scrapping a lot, and there are many weird things in the wild :)

Please note there are quite a few entities with multiple codepoints. Also, I've noticed &amp and & are both valid entities, so I had to sort entities in Util.HtmlCharref.Util.load_entities by their length. Otherwise "Tom & Jerry" could be decoded to "Tom &; Jerry".

My quick solution to this (excerpt from your codebase):

defmodule Util.HtmlCharref do
  def decode(text) when is_binary(text), do: decode(text, [])
  def decode(text), do: text

  # https://html.spec.whatwg.org/entities.json
  @charref_filename "./lib/util/html_charref/entities.txt"
  codes = Util.HtmlCharref.Util.load_entities(@charref_filename)

  for {name, codepoints} <- codes do
    defp decode(<<unquote(name), rest::binary>>, acc) do
      decode(rest, unquote(codepoints) ++ acc)
    end
  end

  defp decode(<<head::utf8, rest::binary>>, acc), do: decode(rest, [head | acc])

  defp decode(<<>>, acc), do: acc |> Enum.reverse() |> List.to_string()
end

P.S. Thank you for a great lib.

martinsvalin commented 4 years ago

Right, I noticed the footnote about which entities allow dropping the semi-colon now that I read the wiki entry more carefully. Let's open a separate issue for this. I'm currently working on creating a mix task to make it easy to generate my source file from a copy of the wikitable, and I started adding support for the [a] footnote, marking the entities in my list that allow no semi-colon.

martinsvalin commented 4 years ago

As for entities that can decode to multiple codepoints, that should be tackled in this issue, or the html 5 entities won't decode properly. Seems simple enough, we'll turn the codepoint part into a list, and replace the entity with all of them.

paveltyk commented 4 years ago

Take a look at this file: https://html.spec.whatwg.org/entities.json Might worth using it instead of wiki table.