Speed up kart diff - Githubissues

koordinates / kart

Distributed version-control for geospatial and tabular data

https://kartproject.org

Other

532 stars 41 forks source link

Speed up kart diff #1018

Open craigds opened 2 weeks ago

craigds commented 2 weeks ago

Generating a large diff using kart is quite slow

$ time kart diff '[EMPTY]...e13cee173cf1a3d14456c7f900dd8763ec321ff7' -o json-lines | pv > /dev/null
2.12GiB 0:03:50 [9.42MiB/s] [                              <=>                                                            ]

real    3m50.265s
user    0m0.134s
sys 0m1.084s

ie this full diff is 2.1 GB and takes 230s for Kart to generate as JSONL at 9MiB/s.

Describe the solution you'd like

Ideally these diffs could be made 5-10x as fast.

We should start with some profiling, although I suspect the limiting factor is git ODB access, which we've found difficult to speed up in the past. If that's the case then the most obvious speedup would be using multiple threads to fetch objects from the ODB.

olsen232 commented 2 weeks ago

Here is some rough profiling data from running a large diff, in a relatively recent Kart (I think v0.15.0) Would obviously depend somewhat on the system and on exactly what is being diffed.

15% generate "raw" diff from ODB using pygit2 and convert it to a kart.diff_structs.Diff object (PK's are decoded, but blob data is not yet loaded)
32% load data from ODB blobs and lazily create data for each Delta object
  includes 16% load data for old blobs
  includes 16% load data for new blobs
7% convert geometry to hex WKB
29% dump JSON
10% "handle system exit" ????

total: 93% - leaves 7% "misc"

More threads is certainly worth trying, also we could try a different JSON library. Eg https://github.com/ijl/orjson

rcoup commented 2 weeks ago

orjson has a good reputation for being the quickest.

10% "handle system exit" ????

garbage collection/deallocation if we build up a lot of dynamic objects?

craigds commented 1 week ago

more profile, from my local kart after merging that orjson speedup:

https://gist.github.com/craigds/2b8c9eac356aae8f61744a3cf8ff4c54

craigds commented 1 week ago

msgspec: https://jcristharif.com/msgspec/benchmarks.html#messagepack-serialization

sounds like msgspec is 1.6x faster than msgpack (for decoding)
msgpack decode is ~41% of kart diff runtime.
so switching libraries would mean an additional 1.15x speedup for kart diff
and presumably similar speedup for any other things kart does for tabular datasets, not just diff

rcoup commented 1 week ago

sounds like msgspec is 1.6x faster than msgpack (for decoding)

40% faster? (0.427s cf 0.799s)

And 70% faster for encoding.

Looks like msgspec does json too? So could potentially not ship it and orjson?