philss / floki

Floki is a simple HTML parser that enables search for nodes using CSS selectors.
https://hex.pm/packages/floki
MIT License
2.07k stars 156 forks source link

Floki fails to parse Some HTML #235

Closed Adzz closed 4 years ago

Adzz commented 5 years ago

Description

If I try to run Floki.parse on this html it fails:

<!DOCTYPE html>
<html lang="en-GB" class="no-js ">
  <head> </head>
  <body>
    <br />*T&C&#39;s Apply.<br />
  </body>
</html>

I get the error:

** (MatchError) no match of right hand side value: "&C&#39;"
    (floki) src/floki_mochi_html.erl:705: :floki_mochi_html.tokenize_charref_raw/3
    (floki) src/floki_mochi_html.erl:651: :floki_mochi_html.tokenize_charref/2
    (floki) src/floki_mochi_html.erl:306: :floki_mochi_html.tokens/3
    (floki) src/floki_mochi_html.erl:83: :floki_mochi_html.parse/1
    (floki) lib/floki/html_parser/mochiweb.ex:7: Floki.HTMLParser.Mochiweb.parse/1

Expected behavior

I think it's valid HTML, so it should parse

Adzz commented 5 years ago

Digging further it looks like this is the line that is failing:

<<CP/utf8>> = 'Elixir.HtmlEntities':decode(<<$&, Raw/binary, $;>>),

in floki/src/floki_mochi_html.erl

I don't know erlang, but does this mean that it wont parse non UTF8 strings?

oskar1233 commented 5 years ago

Just to confirm, I have MatchError on the same line (src/floki_mochi_html.erl:705) with "&dash; right hand side value.

Might be somehow related that my HTML had \uFEFF character on 1st position but the error occurs even after stripping it out.

Adzz commented 4 years ago

Hello, from more investigation it looks like we can strip the failing cases to being html that has a & in it.

It looks like when we hit a & the parser assumes that are about to hit one of these things &#39; (I'm not sure what the name is? An encoded character??). HOWEVER We can sometimes legitimately be hitting a & followed by an escaped apostrophe, for example if in the HTML we say something like Check out these T&C's (terms and conditions). In that case the apostrophe would be escaped to this &#39; meaning Floki sees it as T&C&#39;s. It then incorrectly sees the first & and thinks "this must be escaped html".

I have no idea how to fix.

oskar1233 commented 4 years ago

@Adzz & should be encoded as &amp;. Your HTML should look like T&amp;C&#39;s.

iex(1)> HtmlEntities.decode("T&C&#39;s")
"T&C&#39;s"
iex(2)> HtmlEntities.decode("T&amp;C&#39;s")
"T&C's"

I think it still shouldn't crash, though.

Adzz commented 4 years ago

Sure it should be, but i’m not in charge of writing the html that i’m scraping and I’ve seen this exact case in the wild

philss commented 4 years ago

@Adzz @oskar1233 Thank you for the investigation! And thank you for opening the issue, @Adzz. It was fixed in version 0.23.1. Can you try again with that version?