Performance issue for large data / kotlinx-io

westnordost commented 3 weeks ago

TLDR: Library user experiences performance issues and requests support for kotlinx-io streaming primitives.

As one part of making an app multiplatform, I replaced Java HttpUrlConnection + XML pull parsing with ktor-client + xmlutil + kotlinx-serialization. Unfortunately, the performance at least on parsing large data (tested with ~10MB) got 3-6 times worse. (see StreetComplete#5686 for a short analysis and comparison).

I suspect that using xmlutil's streaming parser instead of deserializing the whole data structure with kotlinx-serialization might improve this somewhat because theoretically, xmlutil should be able start reading the bytes as they come through the wire and in a separate thread than the thread that receives the bytes. (But ultimately not knowing the internals, I can only guess. If you have any other suspicions for the cause of the performance issue, let me know!)

The documentation is a bit thin on streaming, but I understand I need to call xmlStreaming.newGenericReader(Reader) to get an XmlReader which is a xml pull parser interface. However, I need to supply a (Java) Reader, so currently it seems there is no interface for stream parsing on multiplatform.

I understand that byte streaming support and consequently text streaming built on top of that is a bit higgledy-piggledy right now in the Kotlin ecosystem because a replacement for InputSteam etc. has never been available in the Kotlin standard library. So, every library that does something with IO implemented their own thing, if anything at all - sometimes based on okio, sometimes an own implementation.

However, now it seems like things are about to get better: Both ktor and apparently kotlinx-serialization are being migrated to use kotlinx-io and thus the common interfaces like Sink, Source and Buffer, which are replacements for InputStream et al.

So, in case you'd agree that most likely my performance issue with large data stirs from the lack of XML streaming, I guess my request would be to move with kotlinx-serialization and ktor to support a common interface for streaming bytes and text.

westnordost commented 3 weeks ago

I did two experiments:

1. decoding from stream instead

In my code, I replaced xml.decodeFromString<ApiOsm>(string) ... with ... xml.decodeFromReader<ApiOsm>(xmlStreaming.newReader(inputStream.bufferedReader())

However, it takes about the same time, actually is marginally slower than from string. Also, my assumption or hope that the parser can start parsing the incoming bytes while the http client is receiving them in parallel is apparently wrong. No idea why, but that would be on ktor developers, anyway. (Edit: Created an issue there)

2. writing a parser manually, using xmlutil's XmlReader

A manually written parser brings the desired performance gain, cutting the parsing time to about a third. It is still a bit slower than before, but not by much.

Streaming from a stream of bytes is (again) about 25% slower than streaming from a string. When taking into account the time it takes to stream the bytes into a string, it amounts to taking exactly the same time. (And in experiment one I already observed that ktor apparently doesn't make available the input stream until it has been completely received...)

In conclusion, streaming doesn't solve anything (for my use case) and for some reason either kotlinx-serialization or xmlutil's implementation seems to be quite slow.

pdvrieze commented 3 weeks ago

The generic parser on Jvm supports reading directly from InputStreams (with encoding detection) using the KtXmlReader factory function. If you want to use a "native" (stax) reader, you can just create this and get an xml reader calling StAXReader(myReader).

I've been having an initial stab at supporting kotlinx.io which is finally in an usable state. However it is held up a bit because kotlinx.io doesn't support encodings (and I haven't had time to have a decent stab at dealing with that added complexity).

As to performance, I would very much welcome a good performance test suite. As I haven't looked at it explicitly I'm not surprised that it is not optimal. Given all that it does it would never match a hand-rolled parser, but I expect there are quite some optimization options (hence the need to create the test suite to test it - I may repurpose the xmlschema test suite for that - I already include it in the work in progress xmlschema branch - it is big).

westnordost commented 2 weeks ago

I am not able to completely follow. Is this about newReader(Reader) vs newGenericReader(Reader)? I haven't understood the difference between the two.

Given all that it does it would never match a hand-rolled parser

Right. The experiment above was made with your xmlutil XmlParser, though, so it is not completely hand-rolled.

kotlinx.io doesn't support encodings

You mean, any to string decoding, or just anything else than UTF-8?

pdvrieze commented 2 weeks ago

The difference between newReader and newGenericReader is that newGeneric reader will always return a KtXmlReader instance, and thus behave the same on each platform (modulo unsupported encodings). NewReader on JVM/Android/Browser will have different implementations.

What I meant with kotlinx.io decoding is that it doesn't support the byte to character conversion (charset decoding/encoding).

Btw. I've had a look at performance. I've done some testing on parsing xml schema (I already had that going, just needed to create the performance test). I've cut of some big chunks of time already, but there is probably a lot more that can be optimized. The main issue seems to be in building the descriptors (and if you want speed, don't put in order restrictions)

westnordost commented 2 weeks ago

What I meant with kotlinx.io decoding is that it doesn't support the byte to character conversion (charset decoding/encoding).

There's public fun Source.readString(byteCount: Long): String and Sink.writeString(string: String, startIndex: Int = 0, endIndex: Int = string.length) (since 0.3.5) for UTF-8 strings

https://github.com/Kotlin/kotlinx-io/blob/7c4d095d2fe208abe1eb98a5e07c9b5a95cf3dc8/core/common/src/Utf8.kt#L232

The JVM-implementation includes a readString overload where a charset can be defined. (But for my and for most other people's use cases, UTF-8 is probably all they need anyway.)

Though, what most people probably want is a way to encapsulate that in the way the Reader does it on Java and it would make sense to add such a class for kotlinx-io directly rather than leave this up for implementation by every single parser/serializer lib that does something with text. I am not acquainted at all with kotlinx-io yet, so I am not sure if they maybe already have or plan something like that, otherwise it would make sense to create an issue ticket over there.

Btw. I've had a look at performance. I've done some testing on parsing xml schema (I already had that going, just needed to create the performance test). I've cut of some big chunks of time already

🥳 , that's great! On kotlinx-serialization in particular or even in the core?

pdvrieze commented 2 weeks ago

I have already some UTF8 support in the library as it is needed for native serialization (it actually has a FileInputStream implementation). The Source.readString and Sink.writeString are not that helpful for me as the infrastructure reads by character. Again, I have UTF* code already in the library, but need to move it and create a way for it to work with charsets.

As to performance, this is mainly in the way descriptors (that drive the serialization) are created (there was way to much "convenient" code in there). I haven't actually done anything on the xml parsing itself, only serialization. I've already cut the time used to parse 10000 xmlschema documents in half (in 10 iterations, following a warmup run - warmup is significant here).

pdvrieze commented 2 days ago

I have done quite some optimization work on KtXmlReader (shaving at least 30% off the parsing time on the xml schema test suite - from approx 270ms to 200ms). The deserialization has also been optimized, and while it was in the 12000 ms range, it is now at approx 270ms for a list of xml events, or 478ms for both). And it should be worth noting that at least half of the parsing time is taken up by deflating from the resource jar that is used for the testing. I've also added a flag unchecked to the deserialization options that removes some correctness checks.

westnordost commented 2 days ago

Awesome!

westnordost commented 2 days ago

I do expect that and additional most significant speed boost on parsing large XML data in context of multiplatform (no Java) will arrive when xmlutil allows to parse from kotlinx-io's Source (and HTTP libraries like Ktor will provide a Source for the response body) because then parsing can take place in parallel to reading the data from disk.

So, xmlutil would be done parsing just right after the last bytes were read.

pdvrieze commented 2 days ago

If you want to take a look, I've pushed the "optimization" branch with this work.

Btw. there is actually already a native implementation of InputStream (and FileInputStream) that works with native files.

pdvrieze / xmlutil

Performance issue for large data / kotlinx-io #217

1. decoding from stream instead

2. writing a parser manually, using xmlutil's XmlReader