philss / floki

Floki is a simple HTML parser that enables search for nodes using CSS selectors.
https://hex.pm/packages/floki
MIT License
2.07k stars 156 forks source link

Unhandled error for Floki.parse_fragment/2 #444

Closed fireproofsocks closed 1 year ago

fireproofsocks commented 1 year ago

Description

Unhandled error parsing malformed fragment.

To Reproduce

Steps to reproduce the behavior:

iex> input =  "<div style=\"text-align:center;width:100%;margin:22px 0;height:1px;border-top:1px solid #DDDDDD\"></div> <center><div class=\"transparency-container aplus-content-container\"> <a href=\"/b?node=12691228011\"><h3><img src=\"https://images-na.ssl-images-amazon.com/images/G/01/img16/pc/easychoice/landing/easychoice_landing_header.jpg\" width=\"65%\"/></h3></a></center></div> <div style=\"text-align:center;width:100%;margin:22px 0;height:1px;border-top:1px solid #DDDDDD\"></div><B>Internal Modem</B><br> NETGEAR's DG814.<B>Comprehensive</B><br> DG814s (such as NetMeeting).<B>Protective</B><br>. NAT (Network Address Translation).<B>Powerful</B><br> Ultra-fast 10/100 m (328 ft). a 50&#-37;me.<B>Uncomplicated</B><br>"

iex> Floki.parse_fragment(input)

** (ArgumentError) argument error
    (floki 0.34.0) lib/floki/entities.ex:16: Floki.Entities.decode/1
    (floki 0.34.0) src/floki_mochi_html.erl:700: :floki_mochi_html.tokenize_charref_raw/3
    (floki 0.34.0) src/floki_mochi_html.erl:650: :floki_mochi_html.tokenize_charref/2
    (floki 0.34.0) src/floki_mochi_html.erl:298: :floki_mochi_html.tokens/3
    (floki 0.34.0) src/floki_mochi_html.erl:83: :floki_mochi_html.parse/1
    (floki 0.34.0) lib/floki/html_parser/mochiweb.ex:10: Floki.HTMLParser.Mochiweb.parse_document/1

Expected behavior

I would expect Floki.parse_fragment/2 to return an error tuple.

philss commented 1 year ago

In this case the text will not be parsed, and it's going to keep as it is.

I should release a new version soon, but is fixed in the main branch. Thanks!