philss / floki

Floki is a simple HTML parser that enables search for nodes using CSS selectors.
https://hex.pm/packages/floki
MIT License
2.07k stars 156 forks source link

Doctype triggers no function clause matching raw_html/2 #155

Closed jhchen closed 6 years ago

jhchen commented 6 years ago

Sending HTML with a doctype through parse and raw_html produces ** (FunctionClauseError) no function clause matching in Floki.raw_html/2

# Does not work
"<!doctype html><html><head><title>Test</title></head><body>Body</body></html>"
|> Floki.parse() 
|> Floki.raw_html()

# Works as expected (returning original string)
"<html><head><title>Test</title></head><body>Body</body></html>"
|> Floki.parse() 
|> Floki.raw_html()

Inspecting into the input it looks like after parse the first tuple is {:doctype, "html", "", ""} which has four elements and the general match is only expecting three defp raw_html([{type, attrs, children}|tail], html)?

philss commented 6 years ago

@jhchen I think you are right. Just one thing: can you confirm if you are using the HTML5ever parser? Because I think that the Mochiweb parser removes the doctype. This may be the reason we didn't see before.

jhchen commented 6 years ago

Yes we are using html5ever 0.5.0 and floki 0.19.0

philss commented 6 years ago

@jhchen I'm sorry for the delay. I think most of the problem is that we are not testing against the html5ever version. I created the #156 to fix that.

I will be able to fix the bug probably this week. If you want to give a try, it's probably create another function like https://github.com/philss/floki/blob/v0.19.0/lib/floki.ex#L117-L119, but concatenating the strings to build a valid HTML5 doctype: https://www.w3.org/TR/html5/syntax.html#the-doctype

Thanks!