martinsvalin / html_entities

Elixir module for decoding HTML entities.
MIT License
87 stars 24 forks source link

Decoding malformed entities #21

Closed barthez closed 4 years ago

barthez commented 4 years ago

Hello,

Recently I stumbled upon strange bug while parsing HTML with Floki. It causes argument error in HtmlEntities.decode_entity/1:

iex(2)> Floki.parse_document("<title>&#55357;&#56470; San Francisco hotel deals designed for loving.</title>")
** (ArgumentError) argument error
    (html_entities) lib/html_entities.ex:56: HtmlEntities.decode_entity/1
    (html_entities) lib/html_entities.ex:33: HtmlEntities.decode/2
    (floki) src/floki_mochi_html.erl:701: :floki_mochi_html.tokenize_charref_raw/3
    (floki) src/floki_mochi_html.erl:651: :floki_mochi_html.tokenize_charref/2
    (floki) src/floki_mochi_html.erl:306: :floki_mochi_html.tokens/3
    (floki) src/floki_mochi_html.erl:83: :floki_mochi_html.parse/1
    (floki) lib/floki/html_parser/mochiweb.ex:10: Floki.HTMLParser.Mochiweb.parse_document/1

I believe the issue originates from wrongly encoded emoji (šŸ’–) that should have been encoded to &#128150; instead it was encoded to &#55357;&#56470;. This comes from HTML body of an email and I'm not sure who to blame for this: sender's email client or GMail (I was fetching messages via Gmail API).

In the end, I believe HtmlEntities.decode_entity/1 should rather return :error in such case. argument error is raised from this line: https://github.com/martinsvalin/html_entities/blob/e9d55f1da3f14813fc6fff804453d77c8547dd91/lib/html_entities.ex#L56

<<55357::utf8>> is not valid.

It should be easy fix, good candidate for first PR.

Best!

martinsvalin commented 4 years ago

Hi! Thank you for reporting this issue. Investigating made me find several other issues. I've pushed a new version, can you try 0.5.1 and verify that it works for you?

barthez commented 4 years ago

Yes! It works! Thank you very much for a quick response šŸ™‚