This improves the performance of the bulk insert by avoiding re-hashing entries multiple times and using a method that scales with the size of the number of entries to insert rather than the number of slots on disk.
The difference is somewhat less than I was expecting for small inserts, and the second copy performance has degraded since the last time I recorded it in https://github.com/kuzudb/kuzu/issues/2938#issuecomment-2013865169. I suspect this is due at least in part to the locking added in #3388.
Copy benchmarks (all are copies into a 60 million node table containing a single integer primary key column):
Number of Nodes inserted
Before
After
1
~530ms
~500ms
1024
~590ms
~530ms
60000
~1650ms
~1400ms
600000
~8400ms
~5200ms
6000000
~12400ms
~7600ms
60000000
~21600ms
~17300ms
Edit: This is when doing the second copy in the same process. Subsequent testing has revealed that running the second copy in a new process (this was done with kuzu_shell) yields significantly better performance for small copies (e.g. ~20ms for a single tuple instead of ~500ms) but that was also the case before this change.
Copying a single node is more or less identical in performance to inserting, so I will work on merging the implementations in the hash index to just use the bulk storage. That will be a larger and messier PR though.
This improves the performance of the bulk insert by avoiding re-hashing entries multiple times and using a method that scales with the size of the number of entries to insert rather than the number of slots on disk.
The difference is somewhat less than I was expecting for small inserts, and the second copy performance has degraded since the last time I recorded it in https://github.com/kuzudb/kuzu/issues/2938#issuecomment-2013865169. I suspect this is due at least in part to the locking added in #3388.
Edit: This is when doing the second copy in the same process. Subsequent testing has revealed that running the second copy in a new process (this was done with kuzu_shell) yields significantly better performance for small copies (e.g. ~20ms for a single tuple instead of ~500ms) but that was also the case before this change.
Copying a single node is more or less identical in performance to inserting, so I will work on merging the implementations in the hash index to just use the bulk storage. That will be a larger and messier PR though.