Open DemiMarie opened 2 months ago
Related Issues
(Emoji vote if this was helpful or unhelpful; more detailed feedback welcome in this discussion.)
Unfortunately our experience with encoding/xml is that making it more strictly inevitably breaks existing working code.
Does that apply to rejecting clearly ill-formed XML (#68294, #68295)? Could there be new APIs that take a strict bool
flag?
One option for detecting future such bugs would be to differentially fuzz encoding/xml
against libxml2, which is known for its standards conformance.
We could introduce new API. Or in some cases it might be more appropriate to introduce a GODEBUG
setting. But we can't make the package more strict without supporting easy fallbacks. Either approach would require a proposal; see https://go.dev/s/proposal.
Maybe it would be better to choose a 3rd party library to handle this? Do we have such an option now?
Maybe it would be better to choose a 3rd party library to handle this? Do we have such an option now?
I’m not aware of an open-source pure-Go third-party XML tokenizer.
It seems the encoding/xml package is broken in various ways. The solution would be to make a new encoding/xml/v2 package ad to not break backwards compatibility. In doing so the API can also be improved. However this is a large effort, and I suspect someone else than the Go dev team will have to do it.
It seems the encoding/xml package is broken in various ways.
That it is. There are several problems with it:
encoding/xml
returns leading and trailing whitespace in a well-formed document as tokens. Changing this might break backwards compatibility. The same problem arises for the XML declaration.xml.Unmarshal
is sufficiently expressive to guarantee that only documents that validate against the schema are accepted, though it might actually be.@DemiMarie my meaning is that this would be a great occasion to "scratch your own itch" and make such a Go language package for the benefit of the whole community. The Go developers likely have their hands full with other issues.
@bjorndm I don’t have anywhere near enough time for that, sorry. If they have interest, I think @russellhaering might be a good choice, since they have had to deal with the limitations of encoding/xml
in their own libraries.
My current thoughts:
Name
API is broken. Name
needs to have both the prefix and the URL.Sorry, but looking at the current use of the xml package your point 4 is not correct. For example for generating and parsing SVG or other XML documents it is necessary to generate and parse partial XML as well. For such use cases parsing is necessarily less strict.
Do you have an example @bjorndm?
Well, excelize and svgo come to mind.
https://github.com/qax-os/excelize https://github.com/ajstarks/svgo
Well, excelize and svgo come to mind.
https://github.com/qax-os/excelize https://github.com/ajstarks/svgo
Can you provide specific files that need to be parsed?
Excelize parses and generates Microsoft Excel xls files. Svgo generates scalable vector graphics. You can find examples of those everywhere on the web. The hard part is supporting all features of these file formats. The source code of the projects mentioned is more instructive there.
Do either of those formats use DTDs? If not, they only need support for parsing & generating well-formed XML toplevel documents.
These formats can use DTDs. However, in practice, these two libraries and many others use encoding/xml to generate and parse XML fragments, often using xml.Marshal and XML Unmarshal. The generation of xml fragments is this an important use case.
Are these fragments ever not well-formed themselves?
These fragments are likely to be well formed, but they are not complete XML documents as they do not have the headers.
Well-formedness is sufficient here.
Go version
go version go1.21.11 linux/amd64
Output of
go env
in your module/workspace:What did you do?
https://go.dev/play/p/r8y4cgcybkS
What did you see happen?
encoding/xml
accepts ill-formed XML.What did you expect to see?
encoding/xml
should reject all ill-formed XML. Except for the lasttryUnmarshal()
call I linked, the constraints that it fails to check can be checked for without resolving namespaces, and therefore can be checked for even byRawToken()
.See:
Edit: removed #53728 because it is about serialization, not parsing.