Selecting "p" elements duplicates them

StanBright commented 7 years ago

I'm trying to extract the body of some comments on HackerNews. If I select the surrounding "div" element and then put it through Floki.raw_html/1, everything seems good. However, if I select an inner p element, and then put the output through .raw_html, I get all "p" elements duplicated. It's really weird. I spent a whole day to find a work-around and couldn't. Maybe I'm missing something... :/

Here it is how this can be reproduced (Floki v0.17.2)

response = HTTPoison.get!("https://news.ycombinator.com/item?id=14466365")
comment = response.body |> Floki.find("table#hnmain tr.athing") |> List.first
# this one is OK
comment |> Floki.raw_html
# however, trying to extract the content of the internal "p" elements
# triggers the bug/issue, and everything gets duplicated. e.g.
comment |> Floki.find("p") |> Floki.raw_html

Some additional observations:

The html on the page might be invalid/incorrect
There is a  wrapper around all s

Any idea how this could be fixed or worked around?

Thanks!

mischov commented 7 years ago

Look at the result of comment |> Floki.raw_html- in my case (using the mochiweb parser) I'm seeing a ton of repeated s before the end, and no s in the middle, only nested s.

The duplication is because each  contains the subsequent s.

To work around this problem, don't use the mochiweb parser.

Things should work as expected if you use the html5ever parser.

StanBright commented 7 years ago

@mischov thanks for the quick response, mate! Using html5ever resolved the issue.

Thanks for the great work you are doing with Floki!

mischov commented 7 years ago

The great work with Floki is almost wholly @philss's. :)

philss / floki

Selecting "p" elements duplicates them #117