Open paveltyk opened 4 years ago
This seems to be a comprehensive of HTML entities: https://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references
I can build I mapping file, if you are willing to use it in your project.
Hi, thank you for pointing this out. The list you referenced is actually what I used to generate this file, which is then used as a source for all the function clauses to cover these named entities.
The wikipedia page has since been updated to include entities defined in HTML 5.0, growing the list from a few hundred to a few thousand entities.
It's a reasonable addition, but I'll think about if this can be done in a nice way so that users who only need to decode old documents from back when entities were more commonplace can have a slimmer, more performant dependency. Functionally it's a backwards compatible change, but there will be some cost in performance and compiled file size. At least I need to check what the impact is on size and performance.
Where did you find a document in the wild with HTML 5.0 entities in it? I'm a little bit surprised as I don't see good reasons to encode characters beyond the ones needed to produce html-safe text these days.
We do web scrapping a lot, and there are many weird things in the wild :)
Please note there are quite a few entities with multiple codepoints. Also, I've noticed &
and &
are both valid entities, so I had to sort entities in Util.HtmlCharref.Util.load_entities
by their length. Otherwise "Tom & Jerry"
could be decoded to "Tom &; Jerry"
.
My quick solution to this (excerpt from your codebase):
defmodule Util.HtmlCharref do
def decode(text) when is_binary(text), do: decode(text, [])
def decode(text), do: text
# https://html.spec.whatwg.org/entities.json
@charref_filename "./lib/util/html_charref/entities.txt"
codes = Util.HtmlCharref.Util.load_entities(@charref_filename)
for {name, codepoints} <- codes do
defp decode(<<unquote(name), rest::binary>>, acc) do
decode(rest, unquote(codepoints) ++ acc)
end
end
defp decode(<<head::utf8, rest::binary>>, acc), do: decode(rest, [head | acc])
defp decode(<<>>, acc), do: acc |> Enum.reverse() |> List.to_string()
end
P.S. Thank you for a great lib.
Right, I noticed the footnote about which entities allow dropping the semi-colon now that I read the wiki entry more carefully. Let's open a separate issue for this. I'm currently working on creating a mix task to make it easy to generate my source file from a copy of the wikitable, and I started adding support for the [a]
footnote, marking the entities in my list that allow no semi-colon.
As for entities that can decode to multiple codepoints, that should be tackled in this issue, or the html 5 entities won't decode properly. Seems simple enough, we'll turn the codepoint part into a list, and replace the entity with all of them.
Take a look at this file: https://html.spec.whatwg.org/entities.json Might worth using it instead of wiki table.
Expected:
HtmlEntities.decode("100%") #=> "100%"
Actual:HtmlEntities.decode("100%") #=> "100%"