ssbc / jitdb

A database on top of a log with automatic index generation and maintenance
50 stars 7 forks source link

use binary search to get seq from offset #200

Closed staltz closed 2 years ago

staltz commented 2 years ago

Context

I'm preparing for #199 and one thing I realized is that I'll need a quick way of getting the seq from any given offset, e.g. during compaction there is going to be unshiftedOffset and I'd like to discover unshiftedSeq.

Problem

We don't have a helper function for that and currently we're just doing a linear for-loop to look for the offset.

Another problem is that the "end" of the for-loop is tarr.length, which I think is wrong because the tarr grows much more than the count does (like it doubles in size every time it "grows" but most of it is still unused). Instead we need to use index.count.

Solution

Introduce getSeqFromOffset that does a binary search and uses indexes['seq'].count as the end marker.

This should also have a performance improvement for those cases when the jitdb indexes are many and large but we just need to update the tip of the index. This way we can avoid scanning the whole index in O(n) and can do O(log n). I'm not sure we have any benchmark for this but should be a pretty common thing to happen in production.

github-actions[bot] commented 2 years ago

Benchmark results

Part Speed Heap Change Samples
Count 1 big index (3rd run) 0.31ms ± 0.04ms -5.65 kB ± 15.37 kB 52
Create an index twice concurrently 579.51ms ± 2.25ms 31.17 kB ± 61.11 kB 95
Load core indexes 0.83ms ± 0.01ms 93.75 B ± 183.36 B 9274
Load two indexes concurrently 720.66ms ± 13.28ms -90.18 kB ± 139.83 kB 17
Paginate 10 results 25.18ms ± 1.26ms 9.95 kB ± 18.95 kB 28
Paginate 20000 msgs with pageSize=5 6177.78ms ± 77.45ms -123.13 kB ± 4086.27 kB 5
Paginate 20000 msgs with pageSize=500 594.5ms ± 23.94ms -1.52 kB ± 712.31 kB 20
Query 1 big index (1st run) 811.31ms ± 11.76ms 1.67 kB ± 94.22 kB 67
Query 1 big index (2nd run) 281.26ms ± 5.11ms -30.71 kB ± 29.98 kB 47
Query 3 indexes (1st run) 1137.74ms ± 16.27ms 24.41 kB ± 149 kB 47
Query 3 indexes (2nd run) 216.3ms ± 3.1ms -96.49 kB ± 122.63 kB 54
Query a prefix map (1st run) 308.65ms ± 4.08ms 49.13 kB ± 513.65 kB 20
Query a prefix map (2nd run) 14.32ms ± 0.42ms -3.9 kB ± 16.56 kB 24
staltz commented 2 years ago

Here's previous benchmarks to compare with: https://github.com/ssb-ngi-pointer/jitdb/pull/197

But there's no noticeable difference. But do we have a benchmark to measure this case? (A big index receives a bit of updates)

arj03 commented 2 years ago

This looks like a great optimization. I'm wondering if we can't we use bsb that is already included?

staltz commented 2 years ago

Good point! I switched to using bsb.eq in getSeqFromOffset, and I took a look at the source code, it's basically the same, so that's good. Now the PR is a net negative in lines of code changed. :sunglasses:

I also added another commit use seq index.count instead of index.tarr.length.

github-actions[bot] commented 2 years ago

Benchmark results

Part Speed Heap Change Samples
Count 1 big index (3rd run) 0.41ms ± 0.04ms -3.24 kB ± 15.44 kB 44
Create an index twice concurrently 757.32ms ± 7ms 46.4 kB ± 73.81 kB 72
Load core indexes 1.46ms ± 0.02ms 94.27 B ± 248.4 B 6504
Load two indexes concurrently 709.12ms ± 16.92ms 57.29 kB ± 142.61 kB 17
Paginate 10 results 24.43ms ± 1.05ms 972 B ± 18893.48 B 28
Paginate 20000 msgs with pageSize=5 5856.57ms ± 250.37ms -441.06 kB ± 4406.04 kB 5
Paginate 20000 msgs with pageSize=500 455.57ms ± 8.72ms 8.86 kB ± 669.42 kB 22
Query 1 big index (1st run) 822.93ms ± 8.8ms -5.34 kB ± 84.71 kB 66
Query 1 big index (2nd run) 264.16ms ± 1.51ms 15.44 kB ± 31.24 kB 52
Query 3 indexes (1st run) 854.47ms ± 9.14ms -7.25 kB ± 108.83 kB 64
Query 3 indexes (2nd run) 282.62ms ± 1.95ms -27.61 kB ± 172.07 kB 41
Query a prefix map (1st run) 272.8ms ± 5.3ms 10.09 kB ± 179.72 kB 22
Query a prefix map (2nd run) 16.26ms ± 0.45ms 1.2 kB ± 18.66 kB 21
github-actions[bot] commented 2 years ago

Benchmark results

Part Speed Heap Change Samples
Count 1 big index (3rd run) 0.35ms ± 0.03ms -5.7 kB ± 18.5 kB 51
Create an index twice concurrently 585.93ms ± 16.1ms 1.02 kB ± 52.52 kB 94
Load core indexes 1.21ms ± 0.01ms 98.8 B ± 208.69 B 7478
Load two indexes concurrently 700.31ms ± 15.45ms 90.45 kB ± 100.59 kB 18
Paginate 10 results 37.91ms ± 2.34ms 47.24 kB ± 42.95 kB 22
Paginate 20000 msgs with pageSize=5 6090.69ms ± 60.36ms 2.98 MB ± 2.67 MB 5
Paginate 20000 msgs with pageSize=500 540.77ms ± 27.94ms 565.28 kB ± 582.67 kB 20
Query 1 big index (1st run) 819.2ms ± 7.73ms -33.3 kB ± 75.36 kB 66
Query 1 big index (2nd run) 313.45ms ± 2.57ms -6.5 kB ± 33.52 kB 43
Query 3 indexes (1st run) 1026.27ms ± 9.03ms 37.59 kB ± 137.64 kB 53
Query 3 indexes (2nd run) 226.9ms ± 2.62ms -30 kB ± 201.89 kB 52
Query a prefix map (1st run) 294.02ms ± 4.54ms 28.23 kB ± 44.66 kB 21
Query a prefix map (2nd run) 13.75ms ± 0.55ms -7.06 kB ± 14.91 kB 24