ssbc / jitdb

A database on top of a log with automatic index generation and maintenance
50 stars 7 forks source link

Implement group by #198

Open arj03 opened 2 years ago

arj03 commented 2 years ago

Background

@mixmix was working on adding an option to crut list where you would get results ordered by updateTime. We would like to avoid running through all the results (db1 has to do this) in order to display say the latest 5 buttcasts. For this what we really need is a group by operator similar to what you have in a normal database. So this is quite general, you could also use this functionality to show the latest post for each feed. I tried modelling this similar to the seek function. This means we still have to run through results to extract what we need, but since we are working directly on the bipf data, it's not really that bad. This can still be combined with regular where filtering and when used with limit, it will stop once enough results are found.

github-actions[bot] commented 2 years ago

Benchmark results

Part Speed Heap Change Samples
Count 1 big index (3rd run) 0.45ms ± 0.2ms 18.33 kB ± 19.81 kB 50
Create an index twice concurrently 593.75ms ± 6.74ms -33.13 kB ± 56.61 kB 92
Load core indexes 1.32ms ± 0.02ms 97.3 B ± 224.57 B 6960
Load two indexes concurrently 730.78ms ± 14.76ms -23.32 kB ± 266.25 kB 17
Paginate 10 results 24.92ms ± 1.02ms 19.38 kB ± 13.07 kB 27
Paginate 20000 msgs with pageSize=5 5916.4ms ± 150.89ms 2.61 MB ± 1.91 MB 5
Paginate 20000 msgs with pageSize=500 556.99ms ± 29.87ms -523.32 kB ± 796.37 kB 20
Query 1 big index (1st run) 824.39ms ± 9.06ms 26.18 kB ± 58.09 kB 66
Query 1 big index (2nd run) 257.78ms ± 0.85ms 67.36 kB ± 51.14 kB 52
Query 3 indexes (1st run) 870.22ms ± 8.33ms 24.72 kB ± 78.52 kB 63
Query 3 indexes (2nd run) 210.77ms ± 1.2ms 36.53 kB ± 168.67 kB 54
Query a prefix map (1st run) 255.5ms ± 31.03ms -16.3 kB ± 248.32 kB 24
Query a prefix map (2nd run) 11.68ms ± 1.01ms 4.39 kB ± 24.52 kB 21
mixmix commented 2 years ago

I don't quite understand the need for this but trust you.

to be clear, the way I get the 5 most recently updated eg podcasts is pull for type podcast (getting updates and roots). then run this through crut.read and throw out the duds. once I have 5 good ones kill the stream.

oh, and we have a pull.unique in there too. is this what groupBy would be helping with.?

arj03 commented 2 years ago

to be clear, the way I get the 5 most recently updated eg podcasts is pull for type podcast (getting updates and roots). then run this through crut.read and throw out the duds. once I have 5 good ones kill the stream.

oh, and we have a pull.unique in there too. is this what groupBy would be helping with.?

It is the same algorithm, now that you explain it. So right now this is probably only marginally faster. It does maybe improve the discoverability of these kind of things.

I do have an idea to use an index for this, then it should be faster. So challegence accepted :-) Changing this to draft.