proposal: encoding/xml: round-trip safe mode [freeze exception]

FiloSottile commented 3 years ago

Background

Juho Nurminen of Mattermost reported multiple parsing issues in encoding/xml that lead to round-trip mismatches: parsing the output of encoding a list of tokens does not produce the same list of tokens.

Some of those issues are fixed by CLs linked to this issue, but encoding/xml does not provide such a security property, was not designed to, and we don’t believe that fixing the issues we know about is sufficient to guarantee it (nor that the round-trip guarantee is necessarily sufficient for all applications).

Indeed, the investigation kept surfacing different issues impacting round-trip stability, and we are aware of at least one variant affecting the Token API that is neither fixed by an open CL nor detected by the researchers’ validator (which we are told only aims to protect RawToken uses).

Unfortunately, XML-DSig and SAML implementations came to rely on round-trip stability, if not on perfect spec alignment, for their security. This is not entirely surprising, because XML-DSig and SAML are fundamentally fragile designs, but they are also critical enough in the ecosystem that we should try to find a way to let them operate safely.

Regardless of the outcome of this proposal, applications should avoid relying on complex parsers agreeing on how to interpret inputs for their security when possible, and should minimize transfers and modifications between the time of data validation and the time of use.

We’d like to thank Mattermost for reporting these issues, for working with us through the investigation, and for coordinating the disclosure of the issues to the downstream libraries so they could mitigate in advance.

If you use an affected SAML library, please refer to the following security advisories:

Proposal

Hopefully, applications need this kind of security property only when doing limited modifications of a document (for example, extracting a part to forward). If that’s the case, we can try to find a limited subset of the encoding/xml functionality for which we can effectively test that the outputs parse unambiguously.

(We tried fuzzing the full featureset, but the fuzzer found a lot of round-trip changes which are semantically correct, but hard to validate automatically, and changing those behaviors would probably be a breaking change.)

As a first suggestion, I propose adding an IgnoreNamespaces field to Decoder, which when set disables all namespace processing. Names with colons are not split and are copied whole into Name.Local.

We have a prototype of this and it survived extensive fuzzing (with the only caveat that CDATA sections end up merged with surrounding text, as the encoding/xml data structures don’t preserve escaping information).

Open questions

Is this API actually usable by the applications that need it?

Is round-trip stability sufficient, or do all those applications actually need perfect alignment with the spec on how the output is parsed? In other words, if the output is generated by Go but parsed by libxml and they disagree on the list of tokens, is that an issue? If yes, this is a much harder property to provide. HTTP/1 parsers have been struggling with it for decades, as shown by the constant stream of request smuggling vulnerabilities across the ecosystem, and HTTP/1 is arguably a much simpler protocol.

gopherbot commented 3 years ago

Change https://golang.org/cl/277893 mentions this issue: encoding/xml: replace comments inside directives with a space

gopherbot commented 3 years ago

Change https://golang.org/cl/277892 mentions this issue: encoding/xml: handle leading, trailing, or double colons in names

josharian commented 3 years ago

If the goal is extracting a subset of the larger message, that could be obtained with complete confidence by exposing a way to get the byte range corresponding to an element. Then you’d parse, discover the slice that contains the part you need, and slice the original doc for the new content.

DemiMarie commented 3 years ago

Is round-trip stability sufficient, or do all those applications actually need perfect alignment with the spec on how the output is parsed? In other words, if the output is generated by Go but parsed by libxml and they disagree on the list of tokens, is that an issue? If yes, this is a much harder property to provide. HTTP/1 parsers have been struggling with it for decades, as shown by the constant stream of request smuggling vulnerabilities across the ecosystem, and HTTP/1 is arguably a much simpler protocol.

At the very least, Go needs to preserve round-trip stability while processing namespaces correctly. Mattermost’s blog post agrees with me on this: SAML requires correct namespace processing. Furthermore, at least some of the vulnerabilities are due to accepting invalid XML, which encoding/xml simply should not do. If this cannot be changed because of the Go 1 stability guarantee, then the old functions should be deprecated.

We can guarantee round-tripping of arbitrarily complex XML by having encoding/xml produce tokens that include the bytes from which they were parsed. Serializing these tokens would therefore be merely a matter of concatenation. This is a variant of what @josharian suggested.

keshlam commented 3 years ago

Re "invalid XML" -- Please be careful about terminology. XML validity is defined against schemas, and there are many legitimate use cases for processing XML which has not been validated. If the library is accepting ill-formed XML without complaint, or not detecting invalid XML when validation has been requested and a schema provided, that's a problem.

I would recommend trying to avoid adding cost to the process except when actually needed. Offering the option of byte-for-byte roundtripping may make sense; making every user pay the overhead for that probably does not.

Note that XML-DSig 2.0 explicitly references XML Canonicalization as a way to resolve some of the fragilities. I believe that even when XML-DSig 1.0 was announced, the community was very aware that it was stable only for strictly canonicalized documents.

DemiMarie commented 3 years ago

Re "invalid XML" -- Please be careful about terminology. XML validity is defined against schemas, and there are many legitimate use cases for processing XML which has not been validated. If the library is accepting ill-formed XML without complaint, or not detecting invalid XML when validation has been requested and a schema provided, that's a problem.

“invalid” was the terminology used in the CL that changed handling of incorrectly placed colons. I believe that XML is in fact ill-formed.

I would recommend trying to avoid adding cost to the process except when actually needed. Offering the option of byte-for-byte roundtripping may make sense; making every user pay the overhead for that probably does not.

This needs to be clear in the documentation, with explicit examples that can be used for test-cases. I believe we should guarantee that if XML is serialized and then parsed, it produces the same token stream.

Note that XML-DSig 2.0 explicitly references XML Canonicalization as a way to resolve some of the fragilities. I believe that even when XML-DSig 1.0 was announced, the community was very aware that it was stable only for strictly canonicalized documents.

Can encoding/xml guarantee that canonicalized XML does round-trip successfully? encoding/xml should also include canonicalization routines.

keshlam commented 3 years ago

In SAX, at least, same token string is explicitly not guaranteed. Outputting text content, for example, can occur in chunks of whatever size is convenient for the program; reading them back in will occur in chunks of whatever sizes happen to naturally fit the parser's buffer size and the data's alignment with that. If you want token-level standardization from SAX, you need a canonicalizer layer as part of the parse operation.

Similarly, a DOM parser may or may not retain the distinction between text nodes and CDATA nodes. That's a syntactic feature, not a semantic one. And while a parser typically does deliver contiguous text as a single text node, the DOM model is perfectly happy with consecutive text nodes and that detail is not retained during serialization.

XML is defined in terms of the XML Data Model, not a token stream. A difference which makes no semantic difference should make no semantic difference.

DemiMarie commented 3 years ago

In that case, we should include canonicalization in encoding/xml.

FiloSottile commented 3 years ago

What we are trying to figure out is if a subset of the encoding/xml functionality (for example, just a namespace-naive tokenizer) would be useful for SAML implementations (and other applications that need more security properties than we provide), so we can upgrade the security guarantees for that subset.

If SAML and XML-DSig need a superset of not only the security properties, but also of the features of encoding/xml, then it sounds like they definitely can't and shouldn't be implemented on top of encoding/xml.

The feedback I am getting from other discussion venues is that XML-DSig might be complex enough to deserve a dedicated XML tokenizer implementation. Of the three affected libraries, one is discussing deprecating SAML support (dexidp/dex#1884), and the maintainer of the other is saying that they don't have confidence in the protocol safety.

If the conclusion is that a safe SAML implementation needs a new library that is either a substantial extension (as opposed to a subset) or a breaking change of encoding/xml, that should be (or at least start as) a third-party module. Implementing canonicalization and offering strict security guarantees for it is a major project, not a security fix.

/cc the maintainers of some major involved downstreams, although we want to hear from anyone who has code that relies on security properties that encoding/xml does not provide: @ericchiang @russellhaering @crewjam @beevik

timbray commented 3 years ago

Hi, I've written a successful XML parser (in Java) and am also the idiot who someone convinced (20+ years ago) that Namespaces were a good idea and edited that spec. I'm also somewhat Go-competent. I'm pretty sure that something that took a blob of bytes that claimed to be XML, puked on anything not well-formed, and got the namespace semantics right, would make SAML & DSIG happy - although I don't know their Go implementations & thus their API requirements.

It seems plausible that fixing this would, per Go culture, require new APIs. BTW the same thing happened in Ruby, there was a native implementation that was severely broken and then someone wrapped one of the C parsers in a new set of APIs which is what everyone uses now.

Anyhow, purpose of this note is to volunteer to help. Not sure I'm up for doing another XML parser from scratch unless someone's willing to pay for it, but I think I understand the issues.

tqbf commented 3 years ago

Just to chime in, as another Golang SAML implementor and assessor:

I'm alarmed at the idea of trying to bend encoding/xml into part of a SAML/DSIG stack. It was quickly apparent to me when I implemented my SAML IdP that encoding/xml's namespacing support wasn't suitable for DSIG, which is heavily dependent on namespacing, and that its output didn't give me enough control to do canonicalized output.

I've reached the conclusion that for SAML, you really want a defensively-written purpose-built minimal (as minimal as you can get with DSIG) XML; as Advent of Code is proving, I'm not an especially excellent programmer, but the XML parsing required to do SAML was just a few hundred lines of code, and it's intimately woven into DSIG c14n, which is code you have to write anyways to do DSIG.

DSIG is quite probably the scariest cryptography standard in common use, scarier even than X.509, and if the Go standard library provides tools for DSIG like c14n (a source of security bugs in other systems!), the standard library maintainers own the security implications for that.

This makes much more sense as a third-party package, one that doesn't have to make any compromises with the other uses XML is put to.

DemiMarie commented 3 years ago

@tqbf (a) love your username, (b) is your code open source?

keshlam commented 3 years ago

Sounds great to me. I'm more than somewhat Go-incompetent, but if I can find a brain cell or two to contribute...(To further support Tim's authority in this area: he's also the author of the Annotated XML Specification -- https://www.xml.com/axml/axml.html -- which is excellent documentation of exactly what the XML 1.0 spec actually means, why those decisions were made, and where there be tygres and dragonnes.)

rsc commented 3 years ago

Thanks for jumping in, @timbray. Can you say a little more about:

I'm pretty sure that something that took a blob of bytes that claimed to be XML, puked on anything not well-formed, and got the namespace semantics right, would make SAML & DSIG happy - although I don't know their Go implementations & thus their API requirements.

What does "got the namespace semantics right" mean to you? The current encoding/xml tokenizer tried to do this, making it so that readers would see the "actual namespace semantics" and not the raw syntax details like which shortening prefixes were being used. To take the W3C example, given <bk:book xmlns:bk='urn:loc.gov:books'>, the current tokenizer passes up an element with xml.Name{Space: "urn:loc.gov:books", Local: "book"}. You can't even see the bk, which of course can be written using any other variable name and is supposed to have the same meaning. Is this what you meant by "got the namespace semantics right", or something else?

I made two mistakes in the tokenizer, both fixable without new API:

The tokenizer parses \<:foo> the same as \, making it roundtrip to \. That's easy to reject or preserve instead.
The tokenizer doesn't reject elements with multiple namespaces, like <book xmlns="something" xmlns="something-else">. I think we can reject those.

The IgnoreNamespaces proposal in this issue goes much further, passing up bk:book instead. I am honestly not convinced that's right, because programs will then hard-code bk and end up mishandling namespaces. It seems like if a SAML library wanted to use this, it would then have to reimplement the namespace tracking to get the semantics right. And I think many use cases would use IgnoreNamespaces rather than do the right thing with the Name.Space URLs, leading to more incorrect processing of XML than correct processing.

But again, I'm interested to hear what the experts think about what "right" means. Thanks.

I am also fully on board with the comments above that putting encoding/xml inside your security perimeter is a mistake, so I'm a bit reluctant to make changes that are only useful for SAML libraries. But again it seems like even for SAML, IgnoreNamespaces is a mistake.

rogpeppe commented 3 years ago

FWIW a few years ago I fixed a bunch of namespace issues in the encoding/xml package. There are indeed many broken things - at least there were when I last looked. The fixes were merged into the standard library but broke some tests inside Google, so the changes were reverted pending a proper proposal.

If I'd ever had cause to use XML since then, I'd probably have made the effort to go through the whole thing again, but I haven't, I'm afraid :)

A fork of the fixed package still lives at github.com/juju/xml if anyone is interested to dig it out again. The main motivation I originally had for the fixes was to be able to correctly round-trip XML.

timbray commented 3 years ago

@rsc - Your description of the semantics with "bk" and urn:loc.gov:books sounds right to me. IgnoreNamespaces sounds completely crazy to me. You might not like the namespace design (lots of people don't) but you can't ignore them if an XML doc has them.

I'd think that a namespace-sensitive API being used to process an XML doc should return every element & attribute name as a (namespace, name) pair, with the namespace either being the URI or nil.

But I actually haven't yet understood what the bug is, they just say "instability".

keshlam commented 3 years ago

Depending on how you're definining round-tripping, the API may also want to return the prefix.And should also make sure to return namespace declaration attributes, so they appear exactly where they did in the input rather than being generated at the point where they are needed.I absolutely agree that "ignore namespaces" is a Bad Idea. The only place where it made sense was in code specifically written for backward compatibility with documents and apps that predated namespaces. (Which is also the only reason we didn't immediately deprecate the non-namespace-aware DOM APIs.) But there should be nothing which still needs that archaicism; if there is, fix it or replace it.__"Your data is important to us. Please stay on the port, and your connection will be accepted by the next available thread."

rogpeppe commented 3 years ago

For the record, here are some of the CLs that fixed some of the problems that are in encoding/xml (they were all reverted later):

https://go-review.googlesource.com/c/go/+/2660/ https://go-review.googlesource.com/c/go/+/5910/ https://go-review.googlesource.com/c/go/+/6927/ https://go-review.googlesource.com/c/go/+/11635/

With those changes applied, namespace prefixes were not preserved on round-tripping (I think that would be very hard to do while preserving backward compatibility), but namespace semantics were.

keshlam commented 3 years ago

The tokenizer parses <:foo> the same as <foo>, making it roundtrip to <foo>. That's easy to reject or preserve instead.

The tokenizer doesn't reject elements with multiple namespaces, like <book xmlns="something" xmlns="something-else">. I think we can reject those.

Anything claiming to be an XML parser should reject both.

The core XML syntax says you can't have multiple attributes on an element with the same name, and syntactically XML namespace declarations are attributes. This would be an ill-formed document, and the author should be made to fix it. (https://www.w3.org/TR/xml/#uniqattspec)

XML reserved the colon character in attribute and element names for use by namespaces. A qname may not start with a colon, since colon can't appear unless there is a prefix, the prefix is a NCName, and an NCName can't be empty (it must at least have the NameStartChar). (https://www.w3.org/TR/xml-names/#ns-qualnames and drill down from there.)

rsc commented 3 years ago

Anything claiming to be an XML parser should reject both.

We agree with you on that. We can try that for Go 1.17. (But that's not this issue.)

For this issue, it sounds like the general consensus is that we should not add IgnoreNamespaces.

That is, this seems like a likely decline.

timbray commented 3 years ago

+1, Thank you.

DemiMarie commented 3 years ago

@rsc what about @rogpeppe’s CLs? With those added, encoding/xml should round-trip.

rsc commented 3 years ago

Based on the discussion above, this proposal seems like a likely decline. — rsc for the proposal review group

FiloSottile commented 3 years ago

@rsc what about @rogpeppe’s CLs? With those added, encoding/xml should round-trip.

Those too would have to go in Go 1.17 because they significantly change the output.

rsc commented 3 years ago

No change in consensus, so declined. — rsc for the proposal review group

gopherbot commented 3 years ago

Change https://golang.org/cl/355353 mentions this issue: encoding/xml: support xmlns prefixes

golang / go