golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
122.88k stars 17.52k forks source link

encoding/xml: accepts various ill-formed XML #68293

Open DemiMarie opened 2 months ago

DemiMarie commented 2 months ago

Go version

go version go1.21.11 linux/amd64

Output of go env in your module/workspace:

GO111MODULE=''
GOARCH='amd64'
GOBIN=''
GOCACHE='/home/user/.cache/go-build'
GOENV='/home/user/.config/go/env'
GOEXE=''
GOEXPERIMENT=''
GOFLAGS=''
GOHOSTARCH='amd64'
GOHOSTOS='linux'
GOINSECURE=''
GOMODCACHE='/home/user/go/pkg/mod'
GONOPROXY=''
GONOSUMDB=''
GOOS='linux'
GOPATH='/home/user/go'
GOPRIVATE=''
GOPROXY='direct'
GOROOT='/usr/lib/golang'
GOSUMDB='off'
GOTMPDIR=''
GOTOOLCHAIN='local'
GOTOOLDIR='/usr/lib/golang/pkg/tool/linux_amd64'
GOVCS=''
GOVERSION='go1.21.11'
GCCGO='gccgo'
GOAMD64='v1'
AR='ar'
CC='gcc'
CXX='g++'
CGO_ENABLED='1'
GOMOD='/dev/null'
GOWORK=''
CGO_CFLAGS='-O2 -g'
CGO_CPPFLAGS=''
CGO_CXXFLAGS='-O2 -g'
CGO_FFLAGS='-O2 -g'
CGO_LDFLAGS='-O2 -g'
PKG_CONFIG='pkg-config'
GOGCCFLAGS='-fPIC -m64 -pthread -Wl,--no-gc-sections -fmessage-length=0 -ffile-prefix-map=/tmp/go-build3671854511=/tmp/go-build -gno-record-gcc-switches'

What did you do?

https://go.dev/play/p/r8y4cgcybkS

What did you see happen?

encoding/xml accepts ill-formed XML.

What did you expect to see?

encoding/xml should reject all ill-formed XML. Except for the last tryUnmarshal() call I linked, the constraints that it fails to check can be checked for without resolving namespaces, and therefore can be checked for even by RawToken().

See:

Edit: removed #53728 because it is about serialization, not parsing.

gabyhelp commented 2 months ago

Related Issues

(Emoji vote if this was helpful or unhelpful; more detailed feedback welcome in this discussion.)

ianlancetaylor commented 2 months ago

Unfortunately our experience with encoding/xml is that making it more strictly inevitably breaks existing working code.

DemiMarie commented 2 months ago

Does that apply to rejecting clearly ill-formed XML (#68294, #68295)? Could there be new APIs that take a strict bool flag?

One option for detecting future such bugs would be to differentially fuzz encoding/xml against libxml2, which is known for its standards conformance.

ianlancetaylor commented 2 months ago

We could introduce new API. Or in some cases it might be more appropriate to introduce a GODEBUG setting. But we can't make the package more strict without supporting easy fallbacks. Either approach would require a proposal; see https://go.dev/s/proposal.

Zxilly commented 2 months ago

Maybe it would be better to choose a 3rd party library to handle this? Do we have such an option now?

DemiMarie commented 2 months ago

Maybe it would be better to choose a 3rd party library to handle this? Do we have such an option now?

I’m not aware of an open-source pure-Go third-party XML tokenizer.

bjorndm commented 2 months ago

It seems the encoding/xml package is broken in various ways. The solution would be to make a new encoding/xml/v2 package ad to not break backwards compatibility. In doing so the API can also be improved. However this is a large effort, and I suspect someone else than the Go dev team will have to do it.

DemiMarie commented 2 months ago

It seems the encoding/xml package is broken in various ways.

That it is. There are several problems with it:

  1. It accepts directives by default. This requires either accepting ill-formed XML or performing complex checks on DTDs.
  2. It can return the namespace prefix or the namespace URL, but not both. This means that one cannot round-trip XML without reimplementing namespace tracking.
  3. encoding/xml returns leading and trailing whitespace in a well-formed document as tokens. Changing this might break backwards compatibility. The same problem arises for the XML declaration.
  4. There are no guarantees of round-trip stability. Those who need that must reimplement serialization themselves, or use a third-party validator.
  5. There is no support for schema validation. Right now, I believe that all current solutions use libxml2 bindings. There are tools that generate Go code from XML Schema Definition files, but I’m not sure if xml.Unmarshal is sufficiently expressive to guarantee that only documents that validate against the schema are accepted, though it might actually be.
bjorndm commented 2 months ago

@DemiMarie my meaning is that this would be a great occasion to "scratch your own itch" and make such a Go language package for the benefit of the whole community. The Go developers likely have their hands full with other issues.

DemiMarie commented 2 months ago

@bjorndm I don’t have anywhere near enough time for that, sorry. If they have interest, I think @russellhaering might be a good choice, since they have had to deal with the limitations of encoding/xml in their own libraries.

DemiMarie commented 1 month ago

My current thoughts:

  1. The current Name API is broken. Name needs to have both the prefix and the URL.
  2. The automatic conversion of namespace URLs to prefixes done during serialization is both buggy and ill-advised. I recommend requiring the caller to provide proper prefixes instead, and merely checking that the caller made a choice that results in well-formed and namespace-well-formed XML.
  3. The acceptance of arbitrary directives was a mistake.
  4. Only toplevel documents need to be supported.
  5. Round-trip stability should be guaranteed.
  6. Well-formedness should be checked during both parsing and serialization.
bjorndm commented 1 month ago

Sorry, but looking at the current use of the xml package your point 4 is not correct. For example for generating and parsing SVG or other XML documents it is necessary to generate and parse partial XML as well. For such use cases parsing is necessarily less strict.

DemiMarie commented 1 month ago

Do you have an example @bjorndm?

bjorndm commented 1 month ago

Well, excelize and svgo come to mind.

https://github.com/qax-os/excelize https://github.com/ajstarks/svgo

DemiMarie commented 1 month ago

Well, excelize and svgo come to mind.

https://github.com/qax-os/excelize https://github.com/ajstarks/svgo

Can you provide specific files that need to be parsed?

bjorndm commented 1 month ago

Excelize parses and generates Microsoft Excel xls files. Svgo generates scalable vector graphics. You can find examples of those everywhere on the web. The hard part is supporting all features of these file formats. The source code of the projects mentioned is more instructive there.

DemiMarie commented 1 month ago

Do either of those formats use DTDs? If not, they only need support for parsing & generating well-formed XML toplevel documents.

bjorndm commented 1 month ago

These formats can use DTDs. However, in practice, these two libraries and many others use encoding/xml to generate and parse XML fragments, often using xml.Marshal and XML Unmarshal. The generation of xml fragments is this an important use case.

DemiMarie commented 1 month ago

Are these fragments ever not well-formed themselves?

bjorndm commented 1 month ago

These fragments are likely to be well formed, but they are not complete XML documents as they do not have the headers.

DemiMarie commented 1 month ago

Well-formedness is sufficient here.