kuzudb / kuzu

Embeddable property graph database management system built for query speed and scalability. Implements Cypher.
https://kuzudb.com/
MIT License
1.07k stars 77 forks source link

Improve efficiency of merging bulk insertions into the hash index #3403

Closed benjaminwinger closed 2 weeks ago

benjaminwinger commented 2 weeks ago

This improves the performance of the bulk insert by avoiding re-hashing entries multiple times and using a method that scales with the size of the number of entries to insert rather than the number of slots on disk.

The difference is somewhat less than I was expecting for small inserts, and the second copy performance has degraded since the last time I recorded it in https://github.com/kuzudb/kuzu/issues/2938#issuecomment-2013865169. I suspect this is due at least in part to the locking added in #3388.

Copy benchmarks (all are copies into a 60 million node table containing a single integer primary key column): Number of Nodes inserted Before After
1 ~530ms ~500ms
1024 ~590ms ~530ms
60000 ~1650ms ~1400ms
600000 ~8400ms ~5200ms
6000000 ~12400ms ~7600ms
60000000 ~21600ms ~17300ms

Edit: This is when doing the second copy in the same process. Subsequent testing has revealed that running the second copy in a new process (this was done with kuzu_shell) yields significantly better performance for small copies (e.g. ~20ms for a single tuple instead of ~500ms) but that was also the case before this change.

Copying a single node is more or less identical in performance to inserting, so I will work on merging the implementations in the hash index to just use the bulk storage. That will be a larger and messier PR though.