rusterlium / html5ever_elixir

NIF wrapper of html5ever using Rustler
https://hexdocs.pm/html5ever
Apache License 2.0
81 stars 71 forks source link

Allow extracting of comments from an HTML document #121

Closed tobstarr closed 6 months ago

tobstarr commented 10 months ago

I wonder if there is an easy way to extract comments embedded inside an HTML document.

I tried using html5ever with Floki and using the default parser comments are present in the parsed document as

{:comment, "My Comment"}

but when I switch the parser to html5ever they are just stripped. This can also be verified running:

html = """
<html><title>Some Title</title><body><!-- some comment --></body></html>
"""

Floki.parse_document(html)
|> IO.inspect()

Floki.parse_document(html, html_parser: Floki.HTMLParser.Html5ever)
|> IO.inspect()

that results in this output:

{:ok,
 [
   {"html", [],
    [{"title", [], ["Some Title"]}, {"body", [], [comment: " some comment "]}]}
 ]}
{:ok,
 [
   {"html", [],
    [{"head", [], [{"title", [], ["Some Title"]}]}, {"body", [], ["\n"]}]}
 ]}
florish commented 10 months ago

@tobstarr Thanks for opening this issue! We have been using html5ever in a project for a while, but were not aware of HTML comments not being supported until seeing this issue.

I hope this can be fixed and I am willing to help out, but my Rust knowledge is very limited, so I will need some guidance in order to be able to do something.

philss commented 6 months ago

Hey there! Sorry for the delay. It is fixed and I shall release a new version soon.