Open cmyr opened 5 years ago
Thanks for reaching out, these kind of feedbacks are very useful!
I agree this is not super ergonomic but in practice I don't feel this is really an issue. This is very subjective of course and i'd be more than happy to improve the current situation. Just to be sure we're on the same page we want to avoid both converting (checking) to str AND decoding when possible (we'd rather encode the matching keyword once, most of the time at compile time than decode all events).
Having 2 types of events is definitely a good idea and could help. I prefer the XmlSlice
idea because it is lazier and it doesn't hide the fact that the data is not guaranteed to be utf8. XmlSlice could deref to [u8]
and could either implement some TryFrom
or have dedicated to_str
methods.
Okay, glad to hear that you're receptive to this. I'm not sure if it's something I'm going to dig too deeply into, but if I do I'll propose a more thorough design first.
Can you clarify precisely what you mean by checking / decoding? Are you talking only about the source xml data, or are you also thinking about the constants that keys are matched against?
we want to avoid both converting (checking) to str AND decoding when possible
Rust performs those checks for a reason: failing to do so is a security vulnerability. Skipping decoding also makes it impossible to support ASCII-incompatible encodings like UTF-16 (which is required by the XML spec, by the way).
A correct and secure XML parser must always decode.
Right. I don't see why these checks wouldn't be performed here as well, thus I don't see how it'd impact security.
Granted it is a manual process but an absolutely mandatory one:
Event::Start(utf16_event) => {
let utf8_name = reader.decode(utf16_event.local_name())?;
match utf8_name {
"utf8 name1" => { ... }
"utf8 name2" => { ... }
}
Please correct me if I'm wrong.
As far as I know, quick-xml is the only one which actually lets you deal with non-utf8 data at all!
The way the api is done is to provide extra perf to developers who know the xml they are dealing with, which is very common in my experience.
Coming back to the security examples, quick-xml will not try to decode anything per default and will not try to delete or replace any character. True, it will NOT break per default if the data is invalid but I believe it will not let you insert anything malicious.
The way the api is done is to provide extra perf to developers who know the xml they are dealing with, which is very common in my experience.
I'm interested to know what your design goal is (both for this issue and general). Do you always want quick-xml
to default to (or be configurable for) maximum speed at all costs? or is it acceptable to make decisions that degrade performance a little (say, up to 5%, for the sake of having a number) for other goals: ergonomics, specification compliance, etc?
quick-xml
is an order of magnitude faster than xml-rs
; personally, if it were 5% slower than it is now but improved those other qualities, I'd still call it quick, and I'd be really happy.
Besides ergonomics, I think the current encoding design doesn't work. In particular, because of the assumption here that decoding is just something the caller does on text nodes as needed, quick-xml
only can handle UTF-8 and single-byte encodings that are compatible with ASCII (such as ISO-8859-1):
read_question_mark
won't take its &buf[1..4] == b"xml"
path and will never find the right encoding.I see there's issue #322 open about UTF-16 not working, and a test of UTF-16 is #[ignore]
d.
I think the most straightforward approach would be (as well as fixing the detection) to decode non-UTF-8 encodings into a buffer and feed that into the XML parsing. Then everything afterward is UTF-8. And then it can give str
to callers at no additional cost on those encodings (assuming encoding-rs
's output is valid), and just for the cost of std::str::from_utf8
on UTF-8. I believe a fair amount of effort has gone into optimizing std::str::from_utf8
, so I'm hopeful the performance degradation would be minimal.
btw:
Coming back to the security examples, quick-xml will not try to decode anything per default and will not try to delete or replace any character. True, it will NOT break per default if the data is invalid but I believe it will not let you insert anything malicious.
I'm not sure I agree with this. In particular, I suspect it's possible to do "interesting" things via non-canonical UTF-8 and the like. For example, I think the duplicate attribute detection works on the raw bytes now, so by not validating the UTF-8 it's possible to defeat it. Then maybe two implementations differ on which one they use, bypassing some validation logic or something. There are people who are really good at coming up with crazy scenarios vaguely like this that result in actual security holes in some system or another. I'm not really one of them but I'm doing my best to think like them.
I've been thinking about this a bit lately (and writing my thoughts on this PR) so I will copy them here where they are more relevant.
I agree with the OP and also, in particular, this comment
I think the most straightforward approach would be (as well as fixing the detection) to decode non-UTF-8 encodings into a buffer and feed that into the XML parsing. Then everything afterward is UTF-8. And then it can give str to callers at no additional cost on those encodings (assuming encoding-rs's output is valid), and just for the cost of std::str::from_utf8 on UTF-8. I believe a fair amount of effort has gone into optimizing std::str::from_utf8, so I'm hopeful the performance degradation would be minimal.
First, UTF-8 validation can be really really fast with the right libraries. If you look at the benchmarks in https://github.com/rusticstuff/simdutf8, even on years old hardware (Zen 2) it is possible to validate UTF-8 on large blocks of text at a speed of 150 gigabytes per second (for pure ascii) and 17 gigabtyes per second (for Cyrillic and Chinese text). In comparison, quick-xml even without doing any encoding checks only manages around 600-700 megabytes per second (measured using https://github.com/tafia/quick-xml/pull/418). So the cost could be pretty miniscule, 5% actually sounds roughly correct.
One thing you should notice though is that this only applies to large blocks of data. Regardless of whether you're using SIMD or the standard library, validating many small strings is much slower, in terms of throughput, than validating longer ones. So if you take the assumption that users will need to convert some significant portion of their XML to a UTF-8 encoded type (String
or &str
), pre-validating the buffer as UTF-8 could very easily be an overall performance win for most users, rather than being "more work".
Another issue is that if the user does need to support multiple encodings, with the current API (assuming utf-16 support were eventually implemented), they would need to decode things everywhere, which means a lot of small allocations. The performance cost of those allocations is likely to be as bad or worse than simply decoding everything up-front and working with UTF-8 directly.
So the performance case against this is actually not very clear cut. There are still probably scenarios where it would be a bit worse, but I would wager that those scenarios are not so common.
Combine this with some of the other points brought up:
Bytes*
, Attribute
and so forth) needing to know about decoding&str
in many places_and_decode
equivalentAnd I think, it actually does make sense to pre-decode all data as UTF-8. It would most certainly be simpler to use and maintain, we could quickly gain UTF-16 support mostly for "free", and it might actually be faster (but at least, probably not much slower).
I also ran the encoding_rs
benchmark suite to get an idea of how fast encoding / decoding UTF-16 is. It's good enough that I wouldn't expect more than about 30% overhead compared to UTF-8 if we decoded the whole buffer up front - and again, that's assuming:
[0] UTF-8 being more compact on average - it can be worse for Asian / Cyrillic character sets, but with XML you can assume there are a lot of tags, attribute names, and sentinel characters which are ASCII, which diminishes the advantage
I ran some additional benchmarks against quick-xml and it appears my assumptions are correct, at least as far as the utf-8 validation / str
conversion is concerned.
Using the macrobenches
benchmarks, with a Zen 2 R5 3600 CPU, I compared:
std::str::from_utf8()
on attribute names, values and text. reader.decoder().decode()
on event names, values and text. reader.decoder().decode()
on event names and .unescape_and_decode()
functions on attributes and textstd::str::from_utf8()
on the input buffer once per runreader.decoder().decode()
on the input buffer once per runsimdutf8::compat::from_utf8()
on the input buffer once per runOn the existing UTF-8 test inputs. The results:
[dalley@localhost quick-xml]$ critcmp compare baseline small_std_from_utf8 small_decode large_std_from_utf8 large_decode simdutf8 small_with_unescape_and_decode
group baseline large_decode large_std_from_utf8 simdutf8 small_decode small_std_from_utf8 small_with_unescape_and_decode
----- -------- ------------ ------------------- -------- ------------ ------------------- ------------------------------
fully_parse_document/document.xml 1.00 49.8±0.25µs 220.7 MB/sec 1.07 53.1±0.09µs 206.8 MB/sec 1.04 51.8±0.17µs 212.3 MB/sec 1.04 52.0±0.13µs 211.3 MB/sec 1.17 58.5±0.09µs 187.9 MB/sec 1.15 57.4±0.09µs 191.3 MB/sec 1.37 68.1±0.30µs 161.3 MB/sec
fully_parse_document/libreoffice_document.fodt 1.00 160.5±0.26µs 340.3 MB/sec 1.17 187.5±0.35µs 291.2 MB/sec 1.15 184.7±0.85µs 295.6 MB/sec 1.06 169.4±0.33µs 322.3 MB/sec 1.38 220.7±0.32µs 247.4 MB/sec 1.35 216.3±0.80µs 252.5 MB/sec 1.81 290.0±2.19µs 188.3 MB/sec
fully_parse_document/linescore.xml 1.00 10.5±0.05µs 337.7 MB/sec 1.13 11.8±0.83µs 299.6 MB/sec 1.06 11.1±0.03µs 317.8 MB/sec 1.07 11.2±0.04µs 315.5 MB/sec 1.13 11.8±0.20µs 298.2 MB/sec 1.10 11.5±0.03µs 307.2 MB/sec 1.53 16.0±0.23µs 221.2 MB/sec
fully_parse_document/players.xml 1.00 67.2±0.11µs 215.7 MB/sec 1.08 72.8±0.20µs 199.1 MB/sec 1.07 72.1±0.34µs 201.2 MB/sec 1.03 69.0±0.10µs 210.1 MB/sec 1.21 81.2±0.38µs 178.4 MB/sec 1.18 79.1±0.47µs 183.3 MB/sec 1.56 105.0±0.59µs 138.0 MB/sec
fully_parse_document/rpm_filelists.xml 1.00 35.6±0.05µs 308.5 MB/sec 1.11 39.3±0.05µs 279.2 MB/sec 1.10 39.3±0.09µs 279.7 MB/sec 1.08 38.5±0.04µs 285.1 MB/sec 1.22 43.4±0.07µs 253.2 MB/sec 1.19 42.3±0.08µs 259.6 MB/sec 1.62 57.8±0.65µs 190.0 MB/sec
fully_parse_document/rpm_other.xml 1.00 54.8±0.07µs 404.3 MB/sec 1.07 58.4±0.13µs 379.3 MB/sec 1.06 57.9±0.18µs 382.3 MB/sec 1.05 57.5±0.14µs 384.9 MB/sec 1.18 64.7±0.18µs 342.1 MB/sec 1.17 64.3±0.85µs 344.2 MB/sec 1.51 82.7±0.64µs 267.7 MB/sec
fully_parse_document/rpm_primary.xml 1.00 76.2±0.17µs 265.8 MB/sec 1.09 83.3±0.40µs 243.3 MB/sec 1.08 82.3±0.24µs 246.4 MB/sec 1.07 81.7±0.12µs 248.2 MB/sec 1.25 95.6±0.61µs 211.9 MB/sec 1.22 93.2±0.12µs 217.5 MB/sec 1.63 124.5±0.43µs 162.7 MB/sec
fully_parse_document/rpm_primary2.xml 1.00 25.0±0.18µs 287.4 MB/sec 1.09 27.1±0.08µs 264.5 MB/sec 1.07 26.8±0.08µs 267.9 MB/sec 1.07 26.6±0.05µs 269.1 MB/sec 1.22 30.3±0.24µs 236.4 MB/sec 1.20 29.9±0.09µs 240.1 MB/sec 1.63 40.6±0.24µs 176.5 MB/sec
fully_parse_document/sample_1.xml 1.00 4.2±0.01µs 260.9 MB/sec 1.09 4.6±0.01µs 240.4 MB/sec 1.07 4.5±0.01µs 244.5 MB/sec 1.07 4.5±0.01µs 243.9 MB/sec 1.13 4.8±0.03µs 230.6 MB/sec 1.12 4.7±0.02µs 233.0 MB/sec 1.47 6.2±0.02µs 177.5 MB/sec
fully_parse_document/sample_ns.xml 1.00 3.1±0.02µs 233.3 MB/sec 1.10 3.4±0.02µs 212.7 MB/sec 1.07 3.3±0.02µs 217.6 MB/sec 1.08 3.3±0.01µs 216.8 MB/sec 1.11 3.4±0.01µs 209.6 MB/sec 1.09 3.4±0.01µs 214.2 MB/sec 1.44 4.5±0.04µs 161.5 MB/sec
fully_parse_document/sample_rss.xml 1.00 302.4±0.38µs 623.6 MB/sec 1.09 328.4±5.62µs 574.2 MB/sec 1.06 321.3±0.57µs 587.0 MB/sec 1.07 322.8±0.30µs 584.3 MB/sec 1.28 388.5±1.65µs 485.4 MB/sec 1.26 382.2±1.28µs 493.4 MB/sec 1.65 499.0±1.16µs 378.0 MB/sec
fully_parse_document/test_writer_ident.xml 1.00 10.3±0.06µs 410.4 MB/sec 1.04 10.8±0.05µs 393.4 MB/sec 1.03 10.7±0.17µs 397.6 MB/sec 1.06 11.0±0.03µs 386.1 MB/sec 1.15 11.9±0.16µs 356.6 MB/sec 1.14 11.8±0.04µs 359.9 MB/sec 1.51 15.6±0.05µs 272.0 MB/sec
simdutf8
performs best as expected.unescape_and_decode()
functions (where applicable) destroyed performance, with additional overhead ranging between 40% and 80%While I didn't do any testing with UTF-16 yet, I expect similar results, given the impact that the additional small allocations of unescape_and_decode()
in this benchmark.
@tafia @Mingun, does this sound sufficiently compelling to consider adjusting the current approach? Is there any particular additional pieces of data which you would like to collect before making that decision?
The proposal, in summary:
&str
/ String
/ Cow<str>
rather than raw bytes, using known character boundaries to ensure the safety of unsafe fn std::str::from_utf8_unchecked()
, which would not incur any overheadAbout ergonomics: Is it possible to construct a BytesEnd
corresponding to an existing BytesStart
? This would be easy with BytesEnd::wrap
, but this isn't public. BytesEnd::new
takes a str
, that I would have to create manually from the [u8]
of BytesStart
, and this would fail on non-utf8 encodings. I can enclose the name in <…></…>
and create a new Reader
, but this is extremely unergonomic.
Based on the previous comment of @dralley and for implementation of UTF-16 with reasonable effort, I would prefer to convert to or validate as UTF-8 before parsing. Maybe quick-xml could offer both strategies: For example, a const bool UTF8
could be added to the XML Reader and Writer, and the XmlSlice type discussed here. If UTF8 = false
, readers and writers consume/produce [u8]
, and the XmlSlice needs to validate on conversion to str
. If UTF8 = true
, readers only accept str
as inputs, writers offer str
output, and XmlSlice can convert to str
without validation.
About ergonomics: Is it possible to construct a
BytesEnd
corresponding to an existingBytesStart
?
Yes. BytesStart::to_end().
Hi there,
I've just been playing around with quick-xml to handle reading and writing of Glyph Interchange Format files, and I had some observations I wanted to share.
[u8]
/Cow<[u8]>
vsCow<str>
My understanding is that
quick-xml
attempts to do the least work possible, and to this end exposes byte slices by default. I can understand the reasoning (don't allocate unless the user has asked for it explicitly) but there are two things I keep thinking about, here:utf-8 is common: the majority of XML I've been dealing with is encoded in utf-8. In this case, we should be able to provide direct access to string slices, allocating only if we need to escape something.
In a non-utf8 encoding, it is hard to use bytes correctly: exposing bytes directly encourages users to do things like
if attr.key == b"mykey"
(which is used in the readme). This is fine with an ascii-compatible encoding like utf-8 or Shift JIS, but will fail unexpectly on, say, utf-16; and utf-16 is an explicitly supported encoding according to the xml spec.API thoughts
I haven't thought about this too too much, but I do worry that the current behaviour has worse ergonomics than necessary in the general (utf-8) case, while also making it easy to write bugs in the exceptional (utf-16) case.
One possible alternative I could imagine would be having separate
Event
andEventRaw
types;Event
would work withCow<str>
, which would be slices of the underlying buffer where possible (when the encoding is utf-8) and which would be decoded and allocated during parsing, otherwise.EventRaw
would work like the currentevent
.Another option would be to have add some new type,
XmlSlice
, which would include a lazily-allocatingto_str
method, while also exposing access to the underlying bytes and even the encoding; this type would be used everywhere that[u8]
orCow<[u8]>
is, currently. This could also include information like the location of this text in the original document.thanks
I want to clarify that this is not a feature request or a concrete proposal for a new API; I just wanted to voice these thoughts and hear if anyone had anything to add. It may be that what I'm describing should just be an alternative crate, or that there are other design constraints I'm not aware of.
Thank you for the time you've put into this code so far!