ergonomics & encodings - Githubissues

Hi there,

I've just been playing around with quick-xml to handle reading and writing of Glyph Interchange Format files, and I had some observations I wanted to share.

`[u8]`/`Cow<[u8]>` vs `Cow<str>`

My understanding is that quick-xml attempts to do the least work possible, and to this end exposes byte slices by default. I can understand the reasoning (don't allocate unless the user has asked for it explicitly) but there are two things I keep thinking about, here:

utf-8 is common: the majority of XML I've been dealing with is encoded in utf-8. In this case, we should be able to provide direct access to string slices, allocating only if we need to escape something.
In a non-utf8 encoding, it is hard to use bytes correctly: exposing bytes directly encourages users to do things like if attr.key == b"mykey" (which is used in the readme). This is fine with an ascii-compatible encoding like utf-8 or Shift JIS, but will fail unexpectly on, say, utf-16; and utf-16 is an explicitly supported encoding according to the xml spec.

API thoughts

I haven't thought about this too too much, but I do worry that the current behaviour has worse ergonomics than necessary in the general (utf-8) case, while also making it easy to write bugs in the exceptional (utf-16) case.

One possible alternative I could imagine would be having separate Event and EventRaw types; Event would work with Cow<str>, which would be slices of the underlying buffer where possible (when the encoding is utf-8) and which would be decoded and allocated during parsing, otherwise. EventRaw would work like the current event.

Another option would be to have add some new type, XmlSlice, which would include a lazily-allocating to_str method, while also exposing access to the underlying bytes and even the encoding; this type would be used everywhere that [u8] or Cow<[u8]> is, currently. This could also include information like the location of this text in the original document.

thanks

I want to clarify that this is not a feature request or a concrete proposal for a new API; I just wanted to voice these thoughts and hear if anyone had anything to add. It may be that what I'm describing should just be an alternative crate, or that there are other design constraints I'm not aware of.

Thank you for the time you've put into this code so far!

Thanks for reaching out, these kind of feedbacks are very useful!

I agree this is not super ergonomic but in practice I don't feel this is really an issue. This is very subjective of course and i'd be more than happy to improve the current situation. Just to be sure we're on the same page we want to avoid both converting (checking) to str AND decoding when possible (we'd rather encode the matching keyword once, most of the time at compile time than decode all events).

Having 2 types of events is definitely a good idea and could help. I prefer the XmlSlice idea because it is lazier and it doesn't hide the fact that the data is not guaranteed to be utf8. XmlSlice could deref to [u8] and could either implement some TryFrom or have dedicated to_str methods.

Okay, glad to hear that you're receptive to this. I'm not sure if it's something I'm going to dig too deeply into, but if I do I'll propose a more thorough design first.

Can you clarify precisely what you mean by checking / decoding? Are you talking only about the source xml data, or are you also thinking about the constants that keys are matched against?

we want to avoid both converting (checking) to str AND decoding when possible

Rust performs those checks for a reason: failing to do so is a security vulnerability. Skipping decoding also makes it impossible to support ASCII-incompatible encodings like UTF-16 (which is required by the XML spec, by the way).

A correct and secure XML parser must always decode.

Right. I don't see why these checks wouldn't be performed here as well, thus I don't see how it'd impact security.

Granted it is a manual process but an absolutely mandatory one:

Event::Start(utf16_event) => {
    let utf8_name =  reader.decode(utf16_event.local_name())?;
    match utf8_name {
         "utf8 name1" => { ... }
         "utf8 name2" => { ... }
    }

Please correct me if I'm wrong.

As far as I know, quick-xml is the only one which actually lets you deal with non-utf8 data at all!

The way the api is done is to provide extra perf to developers who know the xml they are dealing with, which is very common in my experience.

Coming back to the security examples, quick-xml will not try to decode anything per default and will not try to delete or replace any character. True, it will NOT break per default if the data is invalid but I believe it will not let you insert anything malicious.

The way the api is done is to provide extra perf to developers who know the xml they are dealing with, which is very common in my experience.

I'm interested to know what your design goal is (both for this issue and general). Do you always want quick-xml to default to (or be configurable for) maximum speed at all costs? or is it acceptable to make decisions that degrade performance a little (say, up to 5%, for the sake of having a number) for other goals: ergonomics, specification compliance, etc?

quick-xml is an order of magnitude faster than xml-rs; personally, if it were 5% slower than it is now but improved those other qualities, I'd still call it quick, and I'd be really happy.

Besides ergonomics, I think the current encoding design doesn't work. In particular, because of the assumption here that decoding is just something the caller does on text nodes as needed, quick-xml only can handle UTF-8 and single-byte encodings that are compatible with ASCII (such as ISO-8859-1):

It doesn't use external encoding information and doesn't detect the encoding based on the opening bytes (as in Appendix F of the spec), so read_question_mark won't take its &buf[1..4] == b"xml" path and will never find the right encoding.
Even if it did pick the right encoding, subsequent stuff likewise seems to get compared and sliced assuming its encoding matches ASCII, so UTF-16, ~~Shift-JIS,~~ [edit: actually, Shift-JIS is almost ASCII-compatible according to wikipedia], EBCDIC, etc. won't work.

I see there's issue #322 open about UTF-16 not working, and a test of UTF-16 is #[ignore]d.

I think the most straightforward approach would be (as well as fixing the detection) to decode non-UTF-8 encodings into a buffer and feed that into the XML parsing. Then everything afterward is UTF-8. And then it can give str to callers at no additional cost on those encodings (assuming encoding-rs's output is valid), and just for the cost of std::str::from_utf8 on UTF-8. I believe a fair amount of effort has gone into optimizing std::str::from_utf8, so I'm hopeful the performance degradation would be minimal.

btw:

Coming back to the security examples, quick-xml will not try to decode anything per default and will not try to delete or replace any character. True, it will NOT break per default if the data is invalid but I believe it will not let you insert anything malicious.

I'm not sure I agree with this. In particular, I suspect it's possible to do "interesting" things via non-canonical UTF-8 and the like. For example, I think the duplicate attribute detection works on the raw bytes now, so by not validating the UTF-8 it's possible to defeat it. Then maybe two implementations differ on which one they use, bypassing some validation logic or something. There are people who are really good at coming up with crazy scenarios vaguely like this that result in actual security holes in some system or another. I'm not really one of them but I'm doing my best to think like them.

I've been thinking about this a bit lately (and writing my thoughts on this PR) so I will copy them here where they are more relevant.

I agree with the OP and also, in particular, this comment

I think the most straightforward approach would be (as well as fixing the detection) to decode non-UTF-8 encodings into a buffer and feed that into the XML parsing. Then everything afterward is UTF-8. And then it can give str to callers at no additional cost on those encodings (assuming encoding-rs's output is valid), and just for the cost of std::str::from_utf8 on UTF-8. I believe a fair amount of effort has gone into optimizing std::str::from_utf8, so I'm hopeful the performance degradation would be minimal.

First, UTF-8 validation can be really really fast with the right libraries. If you look at the benchmarks in https://github.com/rusticstuff/simdutf8, even on years old hardware (Zen 2) it is possible to validate UTF-8 on large blocks of text at a speed of 150 gigabytes per second (for pure ascii) and 17 gigabtyes per second (for Cyrillic and Chinese text). In comparison, quick-xml even without doing any encoding checks only manages around 600-700 megabytes per second (measured using https://github.com/tafia/quick-xml/pull/418). So the cost could be pretty miniscule, 5% actually sounds roughly correct.

One thing you should notice though is that this only applies to large blocks of data. Regardless of whether you're using SIMD or the standard library, validating many small strings is much slower, in terms of throughput, than validating longer ones. So if you take the assumption that users will need to convert some significant portion of their XML to a UTF-8 encoded type (String or &str), pre-validating the buffer as UTF-8 could very easily be an overall performance win for most users, rather than being "more work".

Another issue is that if the user does need to support multiple encodings, with the current API (assuming utf-16 support were eventually implemented), they would need to decode things everywhere, which means a lot of small allocations. The performance cost of those allocations is likely to be as bad or worse than simply decoding everything up-front and working with UTF-8 directly.

So the performance case against this is actually not very clear cut. There are still probably scenarios where it would be a bit worse, but I would wager that those scenarios are not so common.

Combine this with some of the other points brought up:

the testing burden of testing parsing code against every encoding
the maintainence burden of maintaining parsing code that works for every encoding
the maintainence burden of every struct (Bytes*, Attribute and so forth) needing to know about decoding
the user frustration of needing to change all of your code if you used the simple utf-8 methods and then suddenly need to parse a utf-16 document and nothing works anymore without changing the code
the user burden of needing to convert to &str in many places
the API bloat where many functions need an _and_decode equivalent

And I think, it actually does make sense to pre-decode all data as UTF-8. It would most certainly be simpler to use and maintain, we could quickly gain UTF-16 support mostly for "free", and it might actually be faster (but at least, probably not much slower).

I also ran the encoding_rs benchmark suite to get an idea of how fast encoding / decoding UTF-16 is. It's good enough that I wouldn't expect more than about 30% overhead compared to UTF-8 if we decoded the whole buffer up front - and again, that's assuming:

the user isn't decoding most of the data anyway, in which case it would likely flip the other way and be faster on average
parsing the UTF-16 natively can be done equally as fast as UTF-8. I somewhat doubt this. Searching for single bytes in a smaller [0] amount of data is easy for the computer and gets a ton of attention in terms of optimization.

[0] UTF-8 being more compact on average - it can be worse for Asian / Cyrillic character sets, but with XML you can assume there are a lot of tags, attribute names, and sentinel characters which are ASCII, which diminishes the advantage

I ran some additional benchmarks against quick-xml and it appears my assumptions are correct, at least as far as the utf-8 validation / str conversion is concerned.

Using the macrobenches benchmarks, with a Zen 2 R5 3600 CPU, I compared:

the baseline - where we get back raw bytes and do nothing to them, as that benchmark currently does
calling std::str::from_utf8() on attribute names, values and text.
calling reader.decoder().decode() on event names, values and text.
calling reader.decoder().decode() on event names and .unescape_and_decode() functions on attributes and text
calling std::str::from_utf8() on the input buffer once per run
calling reader.decoder().decode() on the input buffer once per run
calling simdutf8::compat::from_utf8() on the input buffer once per run

On the existing UTF-8 test inputs. The results:

[dalley@localhost quick-xml]$ critcmp compare baseline small_std_from_utf8 small_decode large_std_from_utf8 large_decode simdutf8 small_with_unescape_and_decode
group                                             baseline                               large_decode                           large_std_from_utf8                    simdutf8                               small_decode                           small_std_from_utf8                    small_with_unescape_and_decode
-----                                             --------                               ------------                           -------------------                    --------                               ------------                           -------------------                    ------------------------------
fully_parse_document/document.xml                 1.00     49.8±0.25µs   220.7 MB/sec    1.07     53.1±0.09µs   206.8 MB/sec    1.04     51.8±0.17µs   212.3 MB/sec    1.04     52.0±0.13µs   211.3 MB/sec    1.17     58.5±0.09µs   187.9 MB/sec    1.15     57.4±0.09µs   191.3 MB/sec    1.37     68.1±0.30µs   161.3 MB/sec
fully_parse_document/libreoffice_document.fodt    1.00    160.5±0.26µs   340.3 MB/sec    1.17    187.5±0.35µs   291.2 MB/sec    1.15    184.7±0.85µs   295.6 MB/sec    1.06    169.4±0.33µs   322.3 MB/sec    1.38    220.7±0.32µs   247.4 MB/sec    1.35    216.3±0.80µs   252.5 MB/sec    1.81    290.0±2.19µs   188.3 MB/sec
fully_parse_document/linescore.xml                1.00     10.5±0.05µs   337.7 MB/sec    1.13     11.8±0.83µs   299.6 MB/sec    1.06     11.1±0.03µs   317.8 MB/sec    1.07     11.2±0.04µs   315.5 MB/sec    1.13     11.8±0.20µs   298.2 MB/sec    1.10     11.5±0.03µs   307.2 MB/sec    1.53     16.0±0.23µs   221.2 MB/sec
fully_parse_document/players.xml                  1.00     67.2±0.11µs   215.7 MB/sec    1.08     72.8±0.20µs   199.1 MB/sec    1.07     72.1±0.34µs   201.2 MB/sec    1.03     69.0±0.10µs   210.1 MB/sec    1.21     81.2±0.38µs   178.4 MB/sec    1.18     79.1±0.47µs   183.3 MB/sec    1.56    105.0±0.59µs   138.0 MB/sec
fully_parse_document/rpm_filelists.xml            1.00     35.6±0.05µs   308.5 MB/sec    1.11     39.3±0.05µs   279.2 MB/sec    1.10     39.3±0.09µs   279.7 MB/sec    1.08     38.5±0.04µs   285.1 MB/sec    1.22     43.4±0.07µs   253.2 MB/sec    1.19     42.3±0.08µs   259.6 MB/sec    1.62     57.8±0.65µs   190.0 MB/sec
fully_parse_document/rpm_other.xml                1.00     54.8±0.07µs   404.3 MB/sec    1.07     58.4±0.13µs   379.3 MB/sec    1.06     57.9±0.18µs   382.3 MB/sec    1.05     57.5±0.14µs   384.9 MB/sec    1.18     64.7±0.18µs   342.1 MB/sec    1.17     64.3±0.85µs   344.2 MB/sec    1.51     82.7±0.64µs   267.7 MB/sec
fully_parse_document/rpm_primary.xml              1.00     76.2±0.17µs   265.8 MB/sec    1.09     83.3±0.40µs   243.3 MB/sec    1.08     82.3±0.24µs   246.4 MB/sec    1.07     81.7±0.12µs   248.2 MB/sec    1.25     95.6±0.61µs   211.9 MB/sec    1.22     93.2±0.12µs   217.5 MB/sec    1.63    124.5±0.43µs   162.7 MB/sec
fully_parse_document/rpm_primary2.xml             1.00     25.0±0.18µs   287.4 MB/sec    1.09     27.1±0.08µs   264.5 MB/sec    1.07     26.8±0.08µs   267.9 MB/sec    1.07     26.6±0.05µs   269.1 MB/sec    1.22     30.3±0.24µs   236.4 MB/sec    1.20     29.9±0.09µs   240.1 MB/sec    1.63     40.6±0.24µs   176.5 MB/sec
fully_parse_document/sample_1.xml                 1.00      4.2±0.01µs   260.9 MB/sec    1.09      4.6±0.01µs   240.4 MB/sec    1.07      4.5±0.01µs   244.5 MB/sec    1.07      4.5±0.01µs   243.9 MB/sec    1.13      4.8±0.03µs   230.6 MB/sec    1.12      4.7±0.02µs   233.0 MB/sec    1.47      6.2±0.02µs   177.5 MB/sec
fully_parse_document/sample_ns.xml                1.00      3.1±0.02µs   233.3 MB/sec    1.10      3.4±0.02µs   212.7 MB/sec    1.07      3.3±0.02µs   217.6 MB/sec    1.08      3.3±0.01µs   216.8 MB/sec    1.11      3.4±0.01µs   209.6 MB/sec    1.09      3.4±0.01µs   214.2 MB/sec    1.44      4.5±0.04µs   161.5 MB/sec
fully_parse_document/sample_rss.xml               1.00    302.4±0.38µs   623.6 MB/sec    1.09    328.4±5.62µs   574.2 MB/sec    1.06    321.3±0.57µs   587.0 MB/sec    1.07    322.8±0.30µs   584.3 MB/sec    1.28    388.5±1.65µs   485.4 MB/sec    1.26    382.2±1.28µs   493.4 MB/sec    1.65    499.0±1.16µs   378.0 MB/sec
fully_parse_document/test_writer_ident.xml        1.00     10.3±0.06µs   410.4 MB/sec    1.04     10.8±0.05µs   393.4 MB/sec    1.03     10.7±0.17µs   397.6 MB/sec    1.06     11.0±0.03µs   386.1 MB/sec    1.15     11.9±0.16µs   356.6 MB/sec    1.14     11.8±0.04µs   359.9 MB/sec    1.51     15.6±0.05µs   272.0 MB/sec

Validating UTF-8 on the entire input buffer has an additional overhead of between 5% and 10% in most cases, and simdutf8 performs best as expected
Validating UTF-8 on individual small strings has an additional overhead ranging between 15% and 30% in most cases
Using .unescape_and_decode() functions (where applicable) destroyed performance, with additional overhead ranging between 40% and 80%

While I didn't do any testing with UTF-16 yet, I expect similar results, given the impact that the additional small allocations of unescape_and_decode() in this benchmark.

@tafia @Mingun, does this sound sufficiently compelling to consider adjusting the current approach? Is there any particular additional pieces of data which you would like to collect before making that decision?

The proposal, in summary:

The data being streamed from the reader would be converted from whichever encoding it is currently in to UTF-8 when copied to a buffer, or else validated as UTF-8 if that is the appropriate encoding, using streaming APIs
UTF-8 would be the only encoding used within the library itself
All decoding functions would be removed from the API
quick-xml would never have to allocate during parsing except from resizing a buffer or doing (un)escaping
All API functions would be changed to return &str / String / Cow<str> rather than raw bytes, using known character boundaries to ensure the safety of unsafe fn std::str::from_utf8_unchecked(), which would not incur any overhead

About ergonomics: Is it possible to construct a BytesEnd corresponding to an existing BytesStart? This would be easy with BytesEnd::wrap, but this isn't public. BytesEnd::new takes a str, that I would have to create manually from the [u8] of BytesStart, and this would fail on non-utf8 encodings. I can enclose the name in <…></…> and create a new Reader, but this is extremely unergonomic.

Based on the previous comment of @dralley and for implementation of UTF-16 with reasonable effort, I would prefer to convert to or validate as UTF-8 before parsing. Maybe quick-xml could offer both strategies: For example, a const bool UTF8 could be added to the XML Reader and Writer, and the XmlSlice type discussed here. If UTF8 = false, readers and writers consume/produce [u8], and the XmlSlice needs to validate on conversion to str. If UTF8 = true, readers only accept str as inputs, writers offer str output, and XmlSlice can convert to str without validation.

About ergonomics: Is it possible to construct a BytesEnd corresponding to an existing BytesStart?

Yes. BytesStart::to_end().

tafia / quick-xml

ergonomics & encodings #158

`[u8]`/`Cow<[u8]>` vs `Cow<str>`

API thoughts

thanks

tafia / quick-xml

ergonomics & encodings #158

[u8]/Cow<[u8]> vs Cow<str>

API thoughts

thanks

`[u8]`/`Cow<[u8]>` vs `Cow<str>`