ndmitchell / tagsoup

Haskell library for parsing and extracting information from (possibly malformed) HTML/XML documents
Other
231 stars 37 forks source link

Added an optTagStrictXML option to ParseOptions #68

Open ChristopherKing42 opened 6 years ago

ChristopherKing42 commented 6 years ago

Based on https://www.w3schools.com/xml/xml_syntax.asp, except the rule that XML options must be quoted.

I wasn't able to figure out where in the code it was dropping the quotes for attributes. If you could point me there, I could make it fully compliant.

(Will close https://github.com/ndmitchell/tagsoup/issues/7 when this feature is added.)

ndmitchell commented 6 years ago

Interesting idea, but I'm not convinced this is the right approach to implementing it. I would have thought a pass that takes a [Tag a] and produces a [Tag a] with additional warnings inserted would be the way to go, while TagTree over strictifies the process, is quite complex, and doesn't seem to add much - really just matching opening and closing tags, but that's easy enough to do as a stream. Or is there some benefit tag tree gives that I'm not seeing?

As a separate point, hs-boot files are usually a nightmare to work with, so I always avoid them.

ChristopherKing42 commented 6 years ago

The reason I used TagTree is to make sure tags opened and closed in the correct order. For example

<html><body></html></body>

is incorrect xml. I guess that isn't too hard to implement on its own, but its precisely what tagTree what does. If you think it should be done another way though, that's fine though.

What do you think the API should be like? Just insert errors whenever it violates the XML standard? Should there be someway to signal if the whole document is correct, or should we just have the user verify that themselves?