Closed Adzz closed 4 years ago
Digging further it looks like this is the line that is failing:
<<CP/utf8>> = 'Elixir.HtmlEntities':decode(<<$&, Raw/binary, $;>>),
in floki/src/floki_mochi_html.erl
I don't know erlang, but does this mean that it wont parse non UTF8 strings?
Just to confirm, I have MatchError
on the same line (src/floki_mochi_html.erl:705
) with "‐
right hand side value.
Might be somehow related that my HTML had \uFEFF
character on 1st position but the error occurs even after stripping it out.
Hello, from more investigation it looks like we can strip the failing cases to being html that has a &
in it.
It looks like when we hit a &
the parser assumes that are about to hit one of these things '
(I'm not sure what the name is? An encoded character??). HOWEVER We can sometimes legitimately be hitting a &
followed by an escaped apostrophe, for example if in the HTML we say something like Check out these T&C's
(terms and conditions). In that case the apostrophe would be escaped to this '
meaning Floki sees it as T&C's
. It then incorrectly sees the first &
and thinks "this must be escaped html".
I have no idea how to fix.
@Adzz
&
should be encoded as &
. Your HTML should look like T&C's
.
iex(1)> HtmlEntities.decode("T&C's")
"T&C's"
iex(2)> HtmlEntities.decode("T&C's")
"T&C's"
I think it still shouldn't crash, though.
Sure it should be, but i’m not in charge of writing the html that i’m scraping and I’ve seen this exact case in the wild
@Adzz @oskar1233 Thank you for the investigation! And thank you for opening the issue, @Adzz. It was fixed in version 0.23.1. Can you try again with that version?
Description
If I try to run
Floki.parse
on this html it fails:I get the error:
Expected behavior
I think it's valid HTML, so it should parse