snoyberg / xml

Various XML utility packages for Haskell
71 stars 64 forks source link

Add option for recovering parser #192

Open diegodiv opened 1 year ago

diegodiv commented 1 year ago

We're sometimes confronted with managing malformed XML data that we can't fix because we didn't generate it and we don't have the possibilities to fix it upstream: these XML files come from somewhere outside of our control. We should still need to be able to parse them at least partially, if possible. So I added a recovery mode to xml-conduit, similar to what exists in lxml with the option recover. The target of this recovering parser is mainly unescaped characters.

k0ral commented 1 year ago

Disclaimer: I'm just a backup maintainer for xml-conduit and the following is just my 2 cents, @snoyberg feel free to overrule :slightly_smiling_face: .

Although this change seems perfectly functional, I do not feel comfortable merging it, and here is why.

Recovering from invalid XML documents in general sounds like a complex feature, one that xml-conduit doesn't claim to provide as far as I know. This is not to say that we shouldn't implement it ever, but if we are to go down the rabbit hole, I believe the topic should be treated as a primary concern in the design of the library, and not as an afterthought. Tackling each edge case in the current state of the library, sounds like a slippery slope to a proliferation of if/else in the codebase that, I believe, will damage its maintainability.

I hope I don't come off as negative/judgmental on the proposed change here, I understand that it may be beneficial to users in the short term, but I think it is detrimental to maintainers (and thus indirectly to the users as well) in the long run.

diegodiv commented 11 months ago

Thanks for the concise reply.

Recovering from invalid XML documents in general sounds like a complex feature, one that xml-conduit doesn't claim to provide as far as I know.

Albeit less complex, xml-conduit already provides targeted recovery features, in the form of psDecodeIllegalCharacters.

Tackling each edge case in the current state of the library, sounds like a slippery slope to a proliferation of if/else in the codebase that, I believe, will damage its maintainability.

At the moment, xml-conduit doesn't provide an interface that makes recovering from errors possible (in many situations): we patched the library in order to extend its behaviour. Nevertheless, the interface provides enough flexibility to recover in other settings, for example mismatched tags, without resorting to patch the library. If the library were not to implement directly more recovery features than it already does, a nice solution could be to provide such an interface.