hecrj / html-parser

Parse HTML 5 in Elm
https://package.elm-lang.org/packages/hecrj/html-parser/latest/
BSD 3-Clause "New" or "Revised" License
61 stars 25 forks source link

DOCTYPE not supported #12

Closed ported-pw closed 4 years ago

ported-pw commented 5 years ago

Hello,

I tried to use the library to parse an entire HTML document and noticed that it does not recognize the <!DOCTYPE html> tag, which is a requirement of a HTML document as per the spec. Trying to parse a document containing the DOCTYPE tag results in the following error: Err [{ col = 3, problem = Expecting "--", row = 2 }] (the parser expects a comment after <!). A demo is here.

Thanks!

hecrj commented 5 years ago

The current Html.Parser.run does not parse full HTML documents.

Right now, it parses different HTML 5 nodes (text, comments, and elements) without checking document structure. This is meant to satisfy the common use case were you have a string of HTML elements and you want to bring them into Elm without adding a DOCTYPE, an html tag, a body tag, etc.

That said, parsing whole HTML documents (ensuring structure validity) is also a valid use case that we should satisfy too. I think we could implement a different function with its own output type for parsing HTML documents.

ported-pw commented 5 years ago

Thanks for the quick reply. Yes, that sounds good. I would try to make that but since I'm quite new at Elm / functional programming in general I'm not sure what the "nicest" way of solving that would be. Also I'd try to implement some querying functions on the parsed tree like Python's bs4 has (I've already written some of it, but not for a general case).

hecrj commented 5 years ago

Also I'd try to implement some querying functions on the parsed tree like Python's bs4 has (I've already written some of it, but not for a general case).

Interesting! I think we should keep html-parser scope limited to parsing. This functionality could be part of another package on top of html-parser.

I would try to make that but since I'm quite new at Elm / functional programming in general I'm not sure what the "nicest" way of solving that would be.

I can try to take a look! I think the hardest part will be to find a new name for run and the new function. Maybe we can have the run function be Parser a -> String -> Result (List DeadEnd) a, and then have body : Parser (List Node) and document : Parser Document.

That way we could use it like:

Parser.run Parser.body "<div>Hello, world!</div>"
Parser.run Parser.document "<!DOCTYPE html><html>...</html>"

We could also expose tag, text, comment, and other individual parsers!

sebn commented 4 years ago

parsing whole HTML documents (ensuring structure validity)

Would it make sense to ignore doctype declarations somehow, so we could still try to parse whole HTML documents (whatever their doctype) and see whether it works?

janwirth commented 4 years ago

👍