philss / floki

Floki is a simple HTML parser that enables search for nodes using CSS selectors.
https://hex.pm/packages/floki
MIT License
2.05k stars 155 forks source link

Unexpected "/floki" tag #50

Open hauntedhost opened 8 years ago

hauntedhost commented 8 years ago

Via #37 it looks like handling unclosed tags is a goal for built-in HTML parser. But I was surprised that a tag name floki was added to the parsed html tree here. Is this expected behavior?

iex> "<div>hello <h1>world</h1><script>alert('wat');</div>" |> Floki.parse
{"div", [],
 ["hello ", {"h1", [], ["world"]},
  {"script", [], ["alert('wat');</div></floki>"]}]}
philss commented 8 years ago

@somlor thanks for the report! No, this is not expected. The floki tag is a "hack" to fix a problem with mochiweb_html parser.

The problem is that the parser can't understand very well HTML snippets without a parent tag (for example: <span>Click</span><button>Here</button> will be parsed as {"span", [], ["Click"]}).

The remotion of this closing tag </floki> can fix the problem. But one test is failing. I'm trying to think a better way to fix this.

hauntedhost commented 8 years ago

@philss OK, makes sense. Let me know if there is anything I can do to help.