Closed ported-pw closed 4 years ago
The current Html.Parser.run
does not parse full HTML documents.
Right now, it parses different HTML 5 nodes (text, comments, and elements) without checking document structure. This is meant to satisfy the common use case were you have a string of HTML elements and you want to bring them into Elm without adding a DOCTYPE
, an html
tag, a body
tag, etc.
That said, parsing whole HTML documents (ensuring structure validity) is also a valid use case that we should satisfy too. I think we could implement a different function with its own output type for parsing HTML documents.
Thanks for the quick reply. Yes, that sounds good. I would try to make that but since I'm quite new at Elm / functional programming in general I'm not sure what the "nicest" way of solving that would be. Also I'd try to implement some querying functions on the parsed tree like Python's bs4 has (I've already written some of it, but not for a general case).
Also I'd try to implement some querying functions on the parsed tree like Python's bs4 has (I've already written some of it, but not for a general case).
Interesting! I think we should keep html-parser
scope limited to parsing. This functionality could be part of another package on top of html-parser
.
I would try to make that but since I'm quite new at Elm / functional programming in general I'm not sure what the "nicest" way of solving that would be.
I can try to take a look! I think the hardest part will be to find a new name for run
and the new function. Maybe we can have the run
function be Parser a -> String -> Result (List DeadEnd) a
, and then have body : Parser (List Node)
and document : Parser Document
.
That way we could use it like:
Parser.run Parser.body "<div>Hello, world!</div>"
Parser.run Parser.document "<!DOCTYPE html><html>...</html>"
We could also expose tag
, text
, comment
, and other individual parsers!
parsing whole HTML documents (ensuring structure validity)
Would it make sense to ignore doctype declarations somehow, so we could still try to parse whole HTML documents (whatever their doctype) and see whether it works?
👍
Hello,
I tried to use the library to parse an entire HTML document and noticed that it does not recognize the
<!DOCTYPE html>
tag, which is a requirement of a HTML document as per the spec. Trying to parse a document containing the DOCTYPE tag results in the following error:Err [{ col = 3, problem = Expecting "--", row = 2 }]
(the parser expects a comment after<!
). A demo is here.Thanks!