philss / floki

Floki is a simple HTML parser that enables search for nodes using CSS selectors.
https://hex.pm/packages/floki
MIT License
2.05k stars 155 forks source link

Selecting "p" elements duplicates them #117

Closed StanBright closed 7 years ago

StanBright commented 7 years ago

I'm trying to extract the body of some comments on HackerNews. If I select the surrounding "div" element and then put it through Floki.raw_html/1, everything seems good. However, if I select an inner p element, and then put the output through .raw_html, I get all "p" elements duplicated. It's really weird. I spent a whole day to find a work-around and couldn't. Maybe I'm missing something... :/

Here it is how this can be reproduced (Floki v0.17.2)

response = HTTPoison.get!("https://news.ycombinator.com/item?id=14466365")
comment = response.body |> Floki.find("table#hnmain tr.athing") |> List.first
# this one is OK
comment |> Floki.raw_html
# however, trying to extract the content of the internal "p" elements
# triggers the bug/issue, and everything gets duplicated. e.g.
comment |> Floki.find("p") |> Floki.raw_html

Some additional observations:

Any idea how this could be fixed or worked around?

Thanks!

mischov commented 7 years ago

Look at the result of comment |> Floki.raw_html- in my case (using the mochiweb parser) I'm seeing a ton of repeated </p>s before the end, and no </p>s in the middle, only nested <p>s.

The duplication is because each <p> contains the subsequent <p>s.

To work around this problem, don't use the mochiweb parser.

Things should work as expected if you use the html5ever parser.

StanBright commented 7 years ago

@mischov thanks for the quick response, mate! Using html5ever resolved the issue.

Thanks for the great work you are doing with Floki!

mischov commented 7 years ago

The great work with Floki is almost wholly @philss's. :)