proposal: encoding/xml: option to treat unknown fields as an error

zelch commented 5 years ago

Towards the end of 2013, issue #6901 was opened requesting support to error on unknown fields instead of silently dropping the data.

In 2016, that issue was closed in favor of issue #15314, where DisallowUnknownFields was implemented for JSON, however that issue was JSON specific, and as a result the issue was never really resolved for encoding/xml.

As such, I would like to propose either adding DisallowUnknownFields to the encoding/xml Decoder type, to exactly mirror the JSON API, or adding DisallowUnknownElements and DisallowUnknownAttributes to the encoding/xml Decoder type.

The former would provide more consistency between the JSON and XML interfaces, the latter would be a moderately better fit for XML.

From a look through the code, it looks like it should be reasonably trivial to implement either version, and I would be happy to do the implantation work if there is agreement on how to go about it.

dmitshur commented 5 years ago

/cc @rsc per owners.

gopherbot commented 4 years ago

Change https://golang.org/cl/191962 mentions this issue: encoding/xml: disallow unknown elements/attributes in Decoder

escholtz commented 4 years ago

I think we should consider returning all unknown elements/attributes (or perhaps the first N). This is useful for cases where xml is being used as a file format and you want to find all element typos/mistakes at once (in the same way the go compiler returns more than one error at once). Behavior could be controlled using an integer instead of a bool.

The downside is that the style would be different from the encoding/json package. While consistency is ideal, there are already several differences between encoding/json and encoding/xml so maybe one more would be ok?

utrack commented 2 years ago

It would be really helpful to have this - the CL mentioned here is fairly compact and goes in style with JSON decoder.

gopherbot commented 2 years ago

Change https://golang.org/cl/362243 mentions this issue: encoding/xml: optional Decoder field enforce

johnnybubonic commented 2 years ago

It appears this has stagnated. I have resolved the merge conflicts and otherwise presented the original CL as-is.

https://go-review.googlesource.com/c/go/+/362243

johnnybubonic commented 2 years ago

Update: no update, no traction, no response, no review.

Time moves on, another day without strictly enforced XML. Once again, XML is treated as the redheaded stepchild compared to JSON. I tire.

🕓

cristaloleg commented 2 years ago

Should this be added to the review meeting minutes?

johnnybubonic commented 2 years ago

If you think it'd help! It's past the freeze so..

ianlancetaylor commented 2 years ago

This is an API change so I'll move it into the proposal process.

It's definitely true that there is very little maintenance on XML and also for that matter JSON.

rsc commented 2 years ago

This proposal has been added to the active column of the proposals project and will now be reviewed at the weekly proposal review meetings. — rsc for the proposal review group

johnnybubonic commented 2 years ago

@rsc Much thanks!

rsc commented 2 years ago

It sounds like there are three things to disallow:

DisallowUnknownElements, to reject nested elements we don't expect
DisallowUnknownAttributes, to reject attributes we don't expect
DisallowUnknownCDATA, to reject plain text (other than white space) we don't expect (?)

Do we need the third? Probably?

johnnybubonic commented 2 years ago

That's an interesting idea. I didn't see any mention of the need in any of the original discussion when Eddie Scholtz submitted his original CL.

What would be the potential use case for that? My personal use case for DisallowUnknownElements/DisallowUnknownAttributes is so I can know if my structs match the input rather than vice versa.

I use W3C XSD for the actual data input validation (via https://pkg.go.dev/github.com/lestrrat-go/libxml2/xsd but there are a couple other implementations out there as well) - which I'm certainly not suggesting be stdlib (as there are multiple data validation/schema formats out there for XML besides XSD - DTD, Schematron, RELAX NG, etc.). I think this is where XML differs significantly from JSON - there's a handful of validation methods out there as XML itself is expected to conform to a schema (however it is defined/implemented). Although some attempts at JSON validation have occurred, JSON doesn't have that inherent expectation.

All that to say that typically XML validation itself is done via schema implementation, but as there isn't a way for golang to "self-generate" a struct type definition from an XSD (or et. al.), this is where the Disallow would come into play. Typically in XML, the element contents* would be validated by the schema mechanism so I'm having difficulty figuring out where DisallowUnknownCDATA plays into that.

Does that make sense? I'm not opposed to DisallowUnknownCDATA, it just seems to be incongruous to the other two proposed additions in function -- at least in my own mind.

escholtz commented 2 years ago

We needed DisallowUnknownElements and DisallowUnknownAttributes. But adding all three sounds reasonable.

Our specific use case is service configuration via xml files. Sometimes customers hand edit the configuration. Catching typos during unmarshalling is a better user experience than silently using default values for a given field.

Ideally we would be able to provide line/column number (or atleast byte offset which can be converted) for unknown fields and attributes. It would also be ideal to gather all of the unknown fields and attributes instead of just the first one. I think much of this quickly grows beyond the scope of a standard library package and reaches the point where it would be better served by a more general purpose xml parser. I suspect everyone's needs and use cases will be slightly different. However, at least providing similar standard library functionality to the json package seems reasonable.

johnnybubonic commented 2 years ago

We needed DisallowUnknownElements and DisallowUnknownAttributes. But adding all three sounds reasonable.

Right; I'm just trying to keep in mind feature creep -- would specifying unknown CDATA inside an element be considered structure non-conformance or data non-conformance? Keeping in mind feature-parity with JSON, is it closer to an improperly formatted JSON key, or is it closer to improper JSON value? This is where things start to get a little philosophical and weird.

Our specific use case is service configuration via xml files. Sometimes customers hand edit the configuration. Catching typos during unmarshalling is a better user experience than silently using default values for a given field.

Agreed, for sure.

Ideally we would be able to provide line/column number (or atleast byte offset which can be converted) for unknown fields and attributes. It would also be ideal to gather all of the unknown fields and attributes instead of just the first one. I think much of this quickly grows beyond the scope of a standard library package and reaches the point where it would be better served by a more general purpose xml parser. I suspect everyone's needs and use cases will be slightly different. However, at least providing similar standard library functionality to the json package seems reasonable.

Yeah, and that's my big worry here -- feature creep. Some of the above is already implemented in the various schemas I mentioned, and have analogues in corresponding libraries that DO already exist:

XSD (25 libraries)
Schematron (no libraries, but os.Exec could call to external tool)
RELAX NG (no libraries(?), etc.)
DTD (2 libraries)

But I think there needs to be some method of Golang recognizing that there are certain structural components that were otherwise silently ignored, which the DisallowUnknownElements and DisallowUnknownAttributes both would serve nicely. The question I'm wrestling with is whether an unknown CDATA qualifies as (formatted/contained) data or structure. The former which I think Golang stdlib oughtn't concern with, since there's no JSON analogue and it's going to be probably specific to implementation.

Theoretically CDATA should be allowed anywhere regular data is since it's just that -- character data; not intended to be parsed as actual XML/SGML/etc. by the parser. So if this would be implemented, it probably ought to be implemented as a more generic "must be empty" struct tag rather than a DisableUnknownCDATA option. I'd imagine a lot of that handling would be more related to content of elements (e.g. off the top of my head, https://github.com/golang/go/issues/21399). Since the purpose of CDATA is for the containing data to not be parsed, we wouldn't really have a method of even identifying what CDATA would be allowed vs. what wouldn't.

But that's why I'm interested in a use case @rsc had in mind because I don't know if I'm glossing over something mentally.

zelch commented 2 years ago

I would suggest reframing it ever so slightly.

It is about data 'which has no location in the structure'.

It's not that CDATA is bad, it's that any data that we can't actually store in the given structure is potentially bad. It most definitely means that it is data that you can't do a round trip between XML -> Go data structures -> XML on.

The key here is, of course, the 'unknown' part. This is all about unknown data. And unknown CDATA is just as much of a thing to want to reject as unknown tags or attributes.

And for many of the same reasons, it is very likely to be an error by whoever or whatever generated the XML, and using just the tools provided by the standard library you not only can't do anything with it, right now you can't even know it exists.

Sure, stepping outside of stdlib gives you more options, but sometimes what you want is the ability to easily handle XML just like you handle JSON, including whatever content validation that you use on the resulting data structures. And that's just not practical if there are ways to insert data that is just silently dropped.

johnnybubonic commented 2 years ago

@zelch Hrmm. Can you provide me with an example of what this would look like with some sample XML and the behavior you're looking for? I'm still having difficulty visualizing this.

I think by and large it's because I'm unclear what constitutes exactly what "unknown CDATA" actually is. I've been understanding it to mean "content within an element that is not intended to be understood as structure", which is what CDATA's intent is, in which case this can be captured with the ,chardata tag option already:

If the XML element contains character data, that data is accumulated in the first struct field that has tag ",chardata". The struct field may have type []byte or string. If there is no such field, the character data is discarded.

(https://pkg.go.dev/encoding/xml#Unmarshal)

Data "which has no location in the structure" ("structure" here meaning the Golang struct) would either be an element or an attribute and nothing else, would it not?

Conceptually I'm unclear what data which has no location in the structure, where structure is that of the XML document, would actually be as that'd be just plain invalid XML from my understanding, e.g.

somethinghereisnotvalid
<root>
  <child/>
</root>

Can you give me some sample XML you're describing and the behavior you want to see?

zelch commented 2 years ago

Entirely from memory:

<root>
  random text
  <subitem attr='value'>
    more random text
      <tag>text we want</tag>
  </subitem>
</root>

Where does the random text go in a Go Structure?

johnnybubonic commented 2 years ago

It isn't captured by Root.Cdata and Sub.Cdata?

package main

import (
    "encoding/xml"
    "fmt"
    "log"
)

type Root struct {
    Name    xml.Name `xml:"root"`
    Cdata   []byte   `xml:",chardata"`
    SubItem Sub      `xml:"subitem"`
}

type Sub struct {
    Name    xml.Name `xml:"subitem"`
    Attr    string   `xml:"attr,attr"`
    Cdata   []byte   `xml:",chardata"`
    TagItem string   `xml:"tag"`
}

var xmlDoc []byte = []byte(`
<root>
  random text
  <subitem attr='value'>
    more random text
      <tag>text we want</tag>
  </subitem>
</root>`)

func main() {

    var err error
    var r Root = Root{}

    if err = xml.Unmarshal(xmlDoc, &r); err != nil {
        log.Panicln(err)
    }

    fmt.Printf("%#v\n", r)
}

rsc commented 2 years ago

Yes, and the question is whether, if the xml:",chardata" fields are missing and the cdata would otherwise be thrown on the floor, there should be a flag to make that an error, same as throwing an attribute or a sub-element on the floor.

johnnybubonic commented 2 years ago

@rsc AH, okay, I think I see. My confusion was stemming from a couple things:

My understanding of "CDATA" used in the original proposed addition referring to CDATA rather than "content within an element ('character data') not otherwise captured".
- The "Unknown" in the method name, which indicates a structural element to me (which confounded further my original "character data" vs. "CDATA" understanding)
There already being a way to capture and check for (otherwise) unserialized content within an element with the ,chardata xml tag option, of which there isn't an equivalent for elements and attributes without the proposed CL.
In e.g. the example above, random text is considered content of the root element per XML spec - so in the XML perspective, it's e.g. /root/text() in XPath 1.0,for instance (specifically, /root/text()[1]). In other words, it seems the purpose of this specific Disallow would be to avoid needing a ,chardata field for every struct and checking if it is nonempty?

It seems then that this particular Disallow is desired to catch content that exists that "shouldn't" due to malformatting in the original data rather than attempting to disallow mismatching structure (as random text in the above example is itself considered content rather than structure in XML; it's contained within an element within the document/DOM).

I resubmitted the original CL with the intent of getting closer feature parity with (encoding/json).Decoder (i.e. structural-only).

While I'm certain that some people may find a use for a DisallowUncapturedChardata or DisallowUncapturedText (which is what I'd recommend for naming instead of DisallowUnknownCDATA to avoid ambiguity and to better describe what it actually does), I believe this would best be addressed in a different proposal and CL as it addresses a different type of enforcement (content instead of structural) and has no JSON equivalent or analog (indeed, this particular situation itself has no JSON equivalent as JSON considers it invalid to mix content and structure within a structure field whereas XML considers it valid).

Is this acceptable?

rsc commented 2 years ago

Yes, ,chardata will allow accumulating unexpected text, but similarly ,any will allow accumulating unexpected elements and ,any,attr will allow accumulating unexpected attributes.

If ,chardata means that we don't need DisallowUnknownChardata, maybe ,any and ,any,attr mean we don't need these other disallows either?

rsc commented 2 years ago

Still would like to hear what people think in response to the questions in the previous comment.

escholtz commented 2 years ago

I think the ,any approach is reasonable and readable. The other advantage is that the caller can traverse the struct tree they are unmarshalling into and know where exactly (tree-wise) the error is occurring.

The second approach would be to accumulate all unknown elements, attributes, and chardata in a custom error type that gets returned at the end of unmarshalling (assuming the flag is set). This might be easier for callers with a large tree of structs (10s or 100s of different types) since they don't need to add ,any receiver fields to each one. Ideally we could also include byte offsets for each unknown which could then be converted to line/column by the caller. This enables more useful error messages when xml is being used for configuration. One disadvantage of this approach is that the location within the tree is unavailable. I suspect line/column will be more useful than tree location but I don't have any data to back that up.

rsc commented 2 years ago

Okay, so given that we have ,any and ,any,attr, does that mean we don't need to add this new mode being proposed?

rsc commented 2 years ago

Based on the discussion above, this proposal seems like a likely decline. — rsc for the proposal review group

rsc commented 2 years ago

No change in consensus, so declined. — rsc for the proposal review group

golang / go

proposal: encoding/xml: option to treat unknown fields as an error #30301