philss / floki

Floki is a simple HTML parser that enables search for nodes using CSS selectors.
https://hex.pm/packages/floki
MIT License
2.05k stars 155 forks source link

Issue with HTML entity in link URL query parameter in floki 0.23.0 #233

Closed cheerfulstoic closed 4 years ago

cheerfulstoic commented 4 years ago

Description

It seems like (as far as I can tell), having an HTML entity in a URL's query parameter value for a link breaks the parser starting with floki 0.23.0. This might also happen in other circumstances. This might not be valid HTML, but I did run into it in the wild and it's probably not great that the parser crashes, especially when it worked previously.

To Reproduce

Steps to reproduce the behavior:

Using:

Erlang/OTP 21 [erts-10.3.5] [source] [64-bit] [smp:12:12] [ds:12:12:10] [async-threads:1] [hipe] [dtrace]

Elixir 1.8.1 (compiled with Erlang/OTP 20)

When I try to parse this HTML:

<a href="http://foo.com/blah?hi=blah&foo=&#43;Park" class="foo">test</a>

... in floki 0.23.0 I get this error:

** (MatchError) no match of right hand side value: "&foo=&#43;"
    src/floki_mochi_html.erl:705: :floki_mochi_html.tokenize_charref_raw/3
    src/floki_mochi_html.erl:651: :floki_mochi_html.tokenize_charref/2
    src/floki_mochi_html.erl:546: :floki_mochi_html.tokenize_quoted_attr_value/4
    src/floki_mochi_html.erl:515: :floki_mochi_html.tokenize_attributes/3
    src/floki_mochi_html.erl:362: :floki_mochi_html.tokenize/2
    src/floki_mochi_html.erl:306: :floki_mochi_html.tokens/3
    src/floki_mochi_html.erl:83: :floki_mochi_html.parse/1
    lib/floki/html_parser/mochiweb.ex:7: Floki.HTMLParser.Mochiweb.parse/1

Expected behavior

With 0.22.0 I get the parsed result as expected:

{"a", [{"href", "http://foo.com/blah?hi=blah&foo=+Park"}, {"class", "foo"}],
 ["test"]}

Standard Thanks

Thanks so much for floki!

philss commented 4 years ago

@cheerfulstoic Thank you for opening the issue! I'm going to take a look. It seems that is related to the replace of mochiweb's charref code here: https://github.com/philss/floki/commit/1198b834f13426b3d8d91f76c5dfbadcc2d94f9d#diff-351de3e8b879ad69ad9c165ca6d8ae9dR705

It is also related to #235

philss commented 4 years ago

@cheerfulstoic It was fixed in version 0.23.1. Can you try again with that version?