golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
122.21k stars 17.46k forks source link

encoding/xml: proposed fixes for namespaces #14407

Open pdw-mb opened 8 years ago

pdw-mb commented 8 years ago

Issue #13400 lists a number of issues related to namespace handling in encoding/xml. It's been noted that this area needs a bit of a rethink. This issue documents a set of proposed fixes to address the problems currently seen. I've grouped the current set of bugs into 7 separate topics to be addressed:

1. Lack of control over prefixes

Encoder currently provides no mechanism for the user to specify namespace/prefix bindings. In theory, this shouldn't matter: documents using different prefixes for the same namespaces are equivalent. In practice, people often do care.

For namespaced attributes, Encoder generates namespace prefixes based on the last part of the URI, and for elements, it redeclares the default namespace as required. This results in documents that are technically correct, but not what the user wants, and the generated prefixes may be cumbersome.

This raises the question of how much control we should give over prefixes.

XML allows quite a lot of complexity: prefixes can be rebound to different namespace URIs within a document, and the same namespace URI can simultaneously be bound to multiple prefixes. There's a very old post on xml-dev that make a plea to produce "sane" documents, that is, ones where a given prefix is only ever used to refer to a single namespace URI, and a given namespace URI is only ever represented by a single prefix:

http://lists.xml.org/archives/xml-dev/200204/msg00170.html

I suggest that it is sufficient that we support the creation of "sane" documents, noting that we should allow a single URI to be represented by both a prefix and the default namespace within a document (this is effectively what Encoder does currently).

I think that the current approach of using the default namespace for elements, and generated prefixes for attributes is a good default: generated prefixes may be ugly, so we should use them only where needed (i.e. attributes), but should provide a mechanism for users to specify their own.

Proposed approach:

Notes/questions:

  <foo>
    <x:bar xmlns:x="blort" />
    <y:baz xmlns:y="blort" />
  </foo>

Issues addressed:

11496: Serializing XML with namespace prefix

9519: support for XML namespace prefixes

I think #9519 is based on a misunderstanding of how the current system works, but it seems likely that the user actually wants control over prefix names. I'm not sure if the reporter realises that a prefix is meaningless (and illegal) without a namespace URI binding.

2. Inability to access/set namespace declaration (handling QName values)

Namespace bindings are sometimes used by element and attribute values. For example:

  <foo xmlns:a="bar">a:blort</foo>

In order to correctly understand "a:blort" you need to know the currently effective namespace bindings. The same problem exists in reverse when encoding: you need to make sure that necessary namespace declarations are in place in the document.

Proposed approach:

We need to allow Unmarshalers and Marshalers to obtain and insert namespace bindings respectively. This means:

  1. A method on Decoder to expose current namespace bindings (trivial - it's already present privately)
  2. A change to UnmarshalerXMLAttr, as this does't currently provide the decoder. The safe way to make this change would be to create a new interface (e.g. UnmarshalerXMLAttrWithDecoder).
  3. Provide a method for Marshalers to inject namespace bindings. I suggest doing this by providing a method on Encoder to obtain a prefix for a namespace (GetPrefix ?), which will then take care of declaring the namespace if it hasn't yet been used. If the user cares what prefix they get, they should provide a preferred prefix prior to making the call to obtain one.
  4. As a convenience, we should make XMLName a MarshalerXML/UnmarshalerXML

Issues addressed:

12406: support QName values / expose namespace bindings

3. Specifying namespaces in tags is cumbersome

Currently namespaces for elements may only be specified by including the full namespace URI, e.g.:

  `xml:"http://www.example.com/some/namespace/v1 foo"`

Aside from being verbose and repetitive, it means URIs can't be changed at runtime. It's not uncommon to want to use the same struct for different namespaces, for example, where version number in the namespace has changed, or as per #12624, to cope with documents using a subtlely wrong namespace.

Proposed solution:

Given the mechanism in (1) to allow the user to specify namespaces/prefix mappings, it makes it possible for a struct to unambiguously use prefixes to reference namespaces. The obvious notation is QName notation:

  `xml:"nsv1:foo"`

Under this proposal it would be an error to use a prefix that hadn't been explicitly specified for the user (i.e. it won't use prefixes picked up from the document when decoding). Users might be surprised that the above wouldn't match the following document unless they'd explicitly set the prefix "nsv1" on the Decoder:

   <nsv1:foo xmlns="...">bar</nsv1:foo>

but doing so would be inherently fragile, as it wouldn't work with the entirely equivalent:

  <foo xmlns="...">bar</foo>

Notes:

This proposal changes the behaviour of Encoder/Decoder for tags with a colon in them, which it's possible that existing code relies on. On the other hand the current behaviour of such tags is clearly a source of confusion and bugs and doesn't work for Decoding anyway (see #11496)

Issues addressed:

9775: Unmarshal does not properly handle NCName in XML namespaces

I think the bug as described is invalid: it's not clear what you'd expect to happen given that the namespace being used is undeclared.

12624: brittle support for matching a namespace by identifier or url

The exact requirement behind this bug is not totally clear: it appears that the user wants unmarshaled elements that have one of a number of namespaces. I don't understand "Xmlns wasn't defined, but the namespace was used (ie. for mRSS with media namespace)" - that sounds like invalid XML.

4. "No namespace" indistinguishable from "any namespace" in struct tags

When decoding, the tag xml:"foo" means element "foo" in any namespace. There's no way to say that you want foo in the null namespace. i.e.

   <foo xmlns="" />

This is a problem if a namespaced sibling of the same localname also exists. #8535 demonstrates this quite clearly.

Proposed approach:

Introduce a way of explicitly referencing the null namespace, e.g.

  `xml:"_ foo"`

We could go for the logical, but horribly subtle:

  `xml:" foo"`

(note the space before foo)

Issues addressed:

8535: failure to handle conflicting tags in different namespaces

11724: namespaced and non-namespaced attributes conflict

5. Bug: default namespace not set to null on un-namespaced children

It's not currently possible to produce the following XML:

<a xmlns="b">
  <c xmlns=""/>
</a>

If you produce <c> with a tag of:

`xml:"c"`

No xmlns declaration will be added, so <c> will inherit the namespace of it's parent <a>. This is related to issue (4): we don't currently distinguish between "any namespace" and "no namespace".

I can see two possible solutions here:

  1. Treat xml:"c" as meaning "no namespace" and insert xmlns="" as required to make that so.
  2. Treat xml:"c" as meaning "any namespace" and make it inherit the namespace of its parent. If you really want no namespace, use xml:"_ c" (or whatever notation we settle on for (4))

I can see arguments both ways.

Issues addressed

7113: encoding/xml: missing nested namespace not handled

6. Bug: xmlns attributes not removed from Token (Decode/Encode not idempotent)

Decoder includes xmlns attributes in start element tokens. For example, an attribute of xmlns:foo="bar" would be included as an attribute with a name of {Space: "xmlns", Local: "foo"}. This is very dubious. xmlns attributes are special, but which ever way you look at it "xmlns" is not a namespace URI - if anything, it's a prefix.

This creates problems if you feed the output of a Decoder into an Encoder, as it treats "xmlns" as a namespace URI, and introduces namespace declarations for it.

There's no good reason to include these attributes. It's reasonable to expose the current set of namespace bindings (see point 2), but the attributes themselves are not needed. If a user really wants to do their own namespace processing, they should use RawToken.

Proposed solution:

Issues addressed:

7535 Encoder duplicates namespace tags

7. Specifying xmlns attributes manually: allow or disallow?

Should we allow users to manually insert xmlns:* or xmlns="..." attributes?

8167: disallow attributes named xmlns:*

11431: encoding/xml: loss of xmlns= in encoding since Go 1.4

I don't think we need to support this, given the mechanism introduced under (1) and (3) above. One of the reasons why you might want to do it, is because namespace URIs are otherwise hard-coded into struct tags. The solution to (3) gives us a mechanism to avoid this.

That said, I'm struggling to see why we couldn't treat this as a call to add a preferred prefix - although there's a question of whether it should force the creation of the xmlns declaration if it's already in scope.

8. Other issues

11735: empty namespace conventions are badly documented

Yes, this should be clearer.

8068: encoding/xml: empty namespace prefix definitions should be illegal

It sounds like this should be resolved as invalid.

SamWhited commented 8 years ago
  1. "No namespace" indistinguishable from "any namespace" in struct tags

If 4 is implemented, it would be nice if the null value could be on the namespace or on the local element name. Eg. if I want to unmarshal either of the following two XMPP errors:

<defined-condition xmlns='urn:ietf:params:xml:ns:xmpp-stanzas'/>
<invalid-xml xmlns='urn:ietf:params:xml:ns:xmpp-stanzas'/>

I might do something like the following:

`xml:"urn:ietf:params:xml:ns:xmpp-stanzas _"`

This could also use the extra space formatting, but that makes things unreadable at a glance, so I definitely wouldn't recommend it.

pdw-mb commented 8 years ago

An interesting use case. Unfortunately, I think it risks creating confusion: (4) suggests using _ to mean "no namespace" as opposed "any namespace". Your use case is asking for "any localname", so is not consistent with the semantics we're assigning to _. We could consider another character (e.g. *). It probably warrants a separate issue, as I think it's independent of what we do for namespaces.

SamWhited commented 8 years ago

Ah, I think I was misunderstanding what 4 was proposing. I've filed #14433 as a separate issue instead. Sorry for the noise.

pazderak commented 8 years ago

I am working on this issue now as my company needs to have this functionality gap covered ASAP. I would like to send some patch soon (i.e. this week).

dimitertodorov commented 7 years ago

Is there any patches proposed for this yet, or a branch in progress?

iwdgo commented 6 years ago
  1. Lack of control over prefixes

    11496 commented as a solution is available in the current version.

    9519 condensed notation (self -closing tag, prefixed tag name,..) is easily fixed but would not be configurable, i.e. only the shortest notation would be available.

  2. Inability to access/set namespace declaration (handling QName values)

    12406: support QName values / expose namespace bindings

    Maps and arrays are available in the decoder but are currently unavailable.

  3. Specifying namespaces in tags is cumbersome Marshal and unmarshal documentation must be read together.

    9775: Commented - invalid syntax

    12624: Commented - invalid syntax

  4. "No namespace" indistinguishable from "any namespace" in struct tags

    8535: Fix submitted

    11724: Same fix (duplicate)

  5. Bug: default namespace not set to null on un-namespaced children

    7113: encoding/xml: missing nested namespace not handled

    No fix yet but issue required other fixes before fixing it.

  6. Bug: xmlns attributes not removed from Token (Decode/Encode not idempotent)

    7535 Fix submitted

iwdgo commented 6 years ago

7113 has a proposed fix. This fix keeps the tracking of depth active even when no indent is required. The request to expose the namespace prefixes (i.e. the existing maps) of the tags is fairly simple but there is loss of information on the actual structure of the XML document, i.e. usage might be confusing.

markfarnan commented 4 years ago

Any likely progress on this issue ?

I am currently having to 'fix' namespaces/bindings in XML documents by passing them through libxml for namespace cleanup. This is less than ideal.

m29h commented 11 months ago

I played on a fork of encoding/xml in my repository github.com/m29h/xml/

  1. It produces always prefixes for namespaced elements in the same way as it is standard for namespaced attributes
  2. The element attributes are sorted before rendering as per the rules of C14N-XML. The byte sequence marshaled output will (at least in typical cases) be C14N Canonical XML.
  3. It can be a drop in replacement for encoding/xml that only changes marshaling behaviour but does not change/limit any interface or other feature today known from encoding/xml.
  4. No external dependencies of the module. To ensure that this is a serious attempt, I kept/adapted 100% of the existing unit tests from encoding/xml and all of them pass with the new serialization behavior.

It does not address all of the issues addressed above, but at least relieves my own pain in my particular use-case.

Akkarine commented 3 months ago

For someone just looking a way to compose valid request for certain server, there is working workaround: https://github.com/golang/go/issues/9519#issuecomment-252196382