Open craigds opened 2 weeks ago
Here is some rough profiling data from running a large diff, in a relatively recent Kart (I think v0.15.0) Would obviously depend somewhat on the system and on exactly what is being diffed.
15% generate "raw" diff from ODB using pygit2 and convert it to a kart.diff_structs.Diff object (PK's are decoded, but blob data is not yet loaded)
32% load data from ODB blobs and lazily create data for each Delta object
includes 16% load data for old blobs
includes 16% load data for new blobs
7% convert geometry to hex WKB
29% dump JSON
10% "handle system exit" ????
total: 93% - leaves 7% "misc"
More threads is certainly worth trying, also we could try a different JSON library. Eg https://github.com/ijl/orjson
orjson has a good reputation for being the quickest.
10% "handle system exit" ????
garbage collection/deallocation if we build up a lot of dynamic objects?
more profile, from my local kart after merging that orjson speedup:
https://gist.github.com/craigds/2b8c9eac356aae8f61744a3cf8ff4c54
msgspec: https://jcristharif.com/msgspec/benchmarks.html#messagepack-serialization
msgspec
is 1.6x faster than msgpack
(for decoding)sounds like msgspec is 1.6x faster than msgpack (for decoding)
40% faster? (0.427s cf 0.799s)
And 70% faster for encoding.
Looks like msgspec does json too? So could potentially not ship it and orjson?
Generating a large diff using kart is quite slow
ie this full diff is 2.1 GB and takes 230s for Kart to generate as JSONL at 9MiB/s.
Describe the solution you'd like
Ideally these diffs could be made 5-10x as fast.
We should start with some profiling, although I suspect the limiting factor is git ODB access, which we've found difficult to speed up in the past. If that's the case then the most obvious speedup would be using multiple threads to fetch objects from the ODB.