ndmitchell / tagsoup

Haskell library for parsing and extracting information from (possibly malformed) HTML/XML documents
Other
231 stars 37 forks source link

Escape single quote (') characters as ' #74

Closed RyanGlScott closed 6 years ago

RyanGlScott commented 6 years ago

I was surprised to discover that the escapeXml function escapes double quotes (") but not single quotes ('). Digging into the source code of tagsoup, I found this comment:

https://github.com/ndmitchell/tagsoup/blob/99a43e9c62627a2541e55b2774bf0f1c9c31f11d/src/Text/HTML/TagSoup/Entity.hs#L73-L82

This suggests that the reason that single quotes aren't escaped is due to Internet Explorer not supporting '. But this feels a bit too conservative, since Internet Explorer does support ', as suggested here. (Credit goes to this stache pull request for that idea.)

Would you be open to escapeXml escaping single quotes as ' instead?

ndmitchell commented 6 years ago

Having Google'd, there doesn't seem to be a consistent set... Quite a lot do &<>, some more do &<>", others do &<>"' and some add / in there for reasons I can't fathom. I think you're right, I'd happily take a patch to make it escape ' to the numeric variant.

If this is the only function you're using in tagsoup, you may wish to consider using https://hackage.haskell.org/package/extra-1.6.9/docs/Data-List-Extra.html#v:escapeHTML instead (which also doesn't escape ', but for which I'd also happily take a patch).

RyanGlScott commented 6 years ago

I've opened #75 and ndmitchell/extra#38 to incorporate this fix into tagsoup and extra, respectively.