Benchmark analysis - Githubissues

staltz commented 3 years ago

Not really an issue, I just want some place where I can report my analysis on the benchmark runs.

On my computer npm run benchmark-no-create from #107 reported 8030ms for the initial indexing test.

In a profiler, the big picture looks like this:

Screenshot from 2021-01-02 22-40-00

There are roughly 4 parts there:

Setup: require everything, set up secret-stack plugins, all that stuff (looks like spiky chaos on the left)
Migrate: that intensive and constant workload up until ~3200ms is just the migrate plugin
Base indexing: after some silence, around ~3500ms, db2 begins indexing the base index
Query: around ~7900ms, the JITDB query starts, begins doing createMissingIndexes, and in this process it needs to do a log.stream, and this log.stream performs decrypt and so forth, and then the final spike before the silence is getBitsetForOperation etc

Stats

Setup took approx 350ms
Migrate took approx 2915ms
Base indexing took approx 4390ms
Query took approx 440ms

Ideas

Separate benchmark reports for migration

Migration's duration should be measured separately from base indexing.

Don't strictly require indexing base index before db.query()

If db.query() would only use JITDB, we shouldn't block it from running until base index is done. In the example in benchmark/index.js, we don't use the base index, so we should be getting the db.query results much earlier.

Allow customizing when should `decrypt` run inside `log.stream`

For the db.query in this example, we didn't need to consider private messages. There should be a way of specifying that we don't need decrypt inside log.stream when indexing something in JITDB. In the ideal situation, JITDB would automatically detect whether it needs unboxing, and pass that down as a parameter to log.stream, but a manual opt-in might also be okay.

Push-streams could/should be paused

If we're doing asynchronous work upon receiving an item from a push-stream, we should pause the push-stream until the async work is done, and then resume it. This would use the backpressure capabilities of push-stream. Probably we're not in deep trouble if we don't do this, but depending on the corner case, if we don't do backpressure, we might just accumulate or exhaust the computer's RAM of tasks and incoming data, and this can have adverse effects elsewhere.

cc @arj03

staltz commented 3 years ago

Allow customizing when should decrypt run inside log.stream

I tested this idea, running a db.query() without decrypt takes 350ms (first JITDB run, i.e. creates the core indexes and bitvector index from scratch). With decrypt it takes 1300ms. That's 3.7x heavier.

I have an idea that is basically just pass down a boolean decrypt: true in the operators, and then JITDB will call log.stream({ decrypt: true }) if it detects that any index in the query had decrypt: true. E.g. db.query(and(type('post', {decrypt: false}), etc) basically will pass the opts down to JITDB and JITDB will pass those opts back to ssb-db2/log.js, which then will not run decrypt.

This way we let the query developer choose wisely when to do decryption or not. The default (is it false or true) can be discussed.

cc @arj03

arj03 commented 3 years ago

We can do a flag, but this is a 1 time cost and its shared between the indexes in ssb-db2 and jitdb so in the overall picture I don't think it will matter that much.

arj03 commented 3 years ago

Also will respond more in full tomorrow :)

arj03 commented 3 years ago

Don't strictly require indexing base index before db.query()

Sounds like a good idea. It would still need a log.drain.

I tested this idea, running a db.query() without decrypt

If it will help manyverse I'll be happy to accept such a PR, but I have been testing ssb-browser and I wouldn't use it there.

arj03 commented 3 years ago

Push-streams could/should be paused

Hmm yeah, would be interesting to test. The main thing I have left for the port of ssb-browser to db2 is that when I do an initial sync the browser just spins for quite a while afterwards. Outside the browser things are fine, I still need to see what is going on but I'll keep this in mind. Thanks for mentioned it.

staltz commented 3 years ago

@arj03 Alright, I thought this through after our video call, and here's what I figured out:

Yes it's true that heavyweight decrypt is run only once, and for subsequent queries it should be much lighter. But from the perspective of the user, it would be good if some queries take priority over others, and the results are shown ASAP while less-priority queries are processed in the background. And all this is about initial indexing, which is our mission to speed up, obviously even ssb-db1 performs fine after initial indexing.

Concurrency

So for a query like "give me all votes for this post", the user doesn't need care about private messages, so any work that is not absolutely necessary should not be blocking the query in question. Because we put decrypt inside log.stream, it applies always, regardless of our query. So my point is mostly, how can we avoid doing work that we don't need to do now, although we certainly need to do it soonish. Those workloads are: (1) base index initial indexing, and (2) private.decrypt.

I did a quick check, and if I turn off (1) or (2), then the benchmark looks like this:

Scenario	Duration
`db.query` for 1 post, concurrently with (1) and (2)	2300ms
`db.query` for 1 post, concurrently with (1)	1060ms
`db.query` for 1 post, concurrently with (2)	1800ms
`db.query` for 1 post	430ms

I'm not sure what's the best solution for that, but two ideas I have are:

Priority system

We could assign some queries to have different priority numbers, and then schedule those that have higher number to run first. This could be too complex to build, specially when there exists workers.

Threads (workers)

I am not sure this would 100% work, because it could be our workloads are I/O bound, but because unboxing is CPU heavy, it could help to run the decrypt on log.stream inside a worker, and both browser and Node.js have it. Respectively, WebWorker and worker_thread.

arj03 commented 3 years ago

Would be interesting to check the threads idea.

If you want to persue the decrypt idea it would be natural to do that in the isPublic operator. As long as the implementation is straight forward I don't see a problem adding that.

staltz commented 3 years ago

I don't think I'll try the "decrypt only if the boolean opt is true" idea, not yet. I'll do some experiments with the threads idea, but I'll have to make a Manyverse release today and tomorrow.

staltz commented 3 years ago

I ran this benchmark again (just for my own sake, to keep these notes later for study), but this time on my production mobile Manyverse and building all sorts of indexes (see list below). The durations indicate "time until first content displays".

Scenario	Duration
Initial indexing, concurrently with (1) and (2)	2m32s
Initial indexing, concurrently with (1)	1m44s
Initial indexing, concurrently with (2)	2m09s
Initial indexing	1m20s

base
canDecrypt
encrypted
key.32prefix
meta_.index
seq.index
sequence.index
timestamp.index
value_author.32prefix
value_content_fork.32prefix
value_content_root.32prefix
value_content_type_content.index
value_content_type_post.index

staltz commented 3 years ago

By the way, to analyze memory use, it's useful to run the Chrome inspector for Node.js. There, I used a "heap snapshot" that showed all strings and objects and buffers allocated. Turns out there's approx. 330 MB of TypedArrays, and most of those are cached blocks in aligned-block-file or async-AOL. Makes sense.

arj03 commented 3 years ago

We could try fiddling with this. Have some good benchmarks now so should be easier to see.

staltz commented 3 years ago

You kind of nerd-sniped me (by the way, I work on these alternative things while Manyverse is compiling, or the app is indexing)

"1024 / 1000" means async-AOL with cache size 1024 and aligned-block-file with cache size 1000

Test	1024 / 1000	512 / 500	256 / 250	128 / 125
migration (using ssb-db)	2762ms	2734ms	2740ms	2775ms
migration (alone)	2675ms	2712ms	2674ms	2728ms
mem usage without indexes	434.34 MB = 30.14 MB + etc	400.20 MB = 24.03 MB + etc	368.01 MB = 27.51 MB + etc	355.83 MB = 31.47 MB + etc
initial indexing	2040ms	2098ms	2198ms	2097ms
initial indexing compat	443ms	405ms	640ms	366ms
Two indexes updating concurrently	764ms	786ms	763ms	875ms
key one initial	471ms	477ms	460ms	438ms
key two	1ms	0ms	0ms	1ms
key one again	1ms	1ms	0ms	1ms
latest root posts	281ms	322ms	299ms	286ms
latest posts	13ms	13ms	13ms	22ms
votes one initial	366ms	430ms	395ms	402ms
votes again	1ms	0ms	0ms	0ms
hasRoot	238ms	305ms	279ms	272ms
hasRoot again	0ms	0ms	0ms	0ms
author one posts	238ms	286ms	249ms	272ms
author two posts	61ms	80ms	110ms	120ms
dedicated author one posts	231ms	247ms	287ms	308ms
dedicated author one posts again	0ms	0ms	1ms	0ms
maximum RAM used	521.02 MB = 56.60 MB + etc	525.47 MB = 53.75 MB + etc	399.65 MB = 54.60 MB + etc	385.50 MB = 57.64 MB + etc

arj03 commented 3 years ago

When you remove everything so the old log is just used for streaming, then disabling the cache in aligned-block-file will make a ton of sense. It be both faster and use less memory.

staltz commented 3 years ago

I'm trying to measure disk I/O performance on desktop and on mobile, to compare, and I'm not sure if I did this correctly because Android and Linux Desktop don't have the same tools, but I did:

On desktop fio --name TEST --eta-newline=5s --filename=temp.file --rw=read --size=2g --io_size=10g --blocksize=1024k --ioengine=libaio --fsync=10000 --iodepth=32 --direct=1 --numjobs=1 --runtime=60 --group_reporting (this measures sequential reads) and got 1038 MiB/s

On Android I used an app called Disk Speed / Performance Test and got Read 189 MB/s and Write 34 MB/s.

Not sure if I can trust these numbers, but it seems like mobile is 5.5x slower than desktop. If it takes ~3 sec for desktop to scan the log, it should take 17 sec for mobile to scan it.

staltz commented 3 years ago

DB2 Benchmarks running in Manyverse

Nothing else running. Total msg count: 1015957

A: Part	B: Duration	C: Column B scaled to 100k msgs	D: Latest PR with 100k msgs	C / D
Initial indexing	83529ms	8221ms	2141ms	3.83x
Initial indexing compat	21181ms	2084ms	529ms	3.93x
Two indexes updating concurrently	17155ms	1688ms	1016ms	1.66x
key one initial	14073ms	1385ms	614ms	2.25x
key two	15ms	2ms	0ms	–
key one again	8ms	1ms	0ms	–
latest root posts	11843ms	1165ms	347ms	3.35x
latest posts	519ms	51ms	9ms	5.66x
votes one initial	15168ms	1492ms	504ms	2.96x
votes again	6ms	1ms	1ms	–
hasRoot	11227ms	1105ms	298ms	3.70x
hasRoot again	2ms	0ms	0ms	–
author one posts	12736ms	1253ms	287ms	4.36x
author two posts	5066ms	498ms	55ms	9.05x
dedicated author one posts	11687ms	1150ms	279ms	4.12x
dedicated author one posts again	5ms	1ms	0ms	–

Mobile seems on average 4.07x slower than GitHub actions server. So that's quite realistically close to the "5.5x slower" disc I/O.

staltz commented 3 years ago

On mobile:

ssb:db2:aboutSelf loaded full cache in 799ms

Note a lot, but almost 1 second. I can imagine that not loading the whole thing might speed up startup.

arj03 commented 3 years ago

Hmm yeah, might be one of those things that are hard to benchmark on a computer compared to a phone. Interested to hear how it is without descriptions.

staltz commented 3 years ago

Found out that loading the key__map.32prefixmap index takes a long while (on mobile):

jitdb readFile prefix map /data/user/0/se.manyver/files/.ssb/db2/indexes/key__map.32prefixmap in 1732ms
jitdb parsed prefix map JSON /data/user/0/se.manyver/files/.ssb/db2/indexes/key__map.32prefixmap in 1576ms

3300ms in total, 1st part is loading the atomically-universal, 2nd part is the JSON.parse.

What would the performance look like if we did it in leveldb? Perhaps load time would be smaller, and query time would be larger?

arj03 commented 3 years ago

Hmm. That is a good data point, I'll have to check that to see.

arj03 commented 3 years ago

Maybe we should pair on it tomorrow?

staltz commented 3 years ago

Yes!

staltz commented 3 years ago

Well, I have a PR that adds a leveldb index for that and it seems it speeds up things rather a lot. PR incoming, it requires a jitdb change first, though.

arj03 commented 3 years ago

I wonder if it will be one of those: fs faster in browser, level faster in node ;-)

staltz commented 3 years ago

I found another thing that maybe indicates some opportunity to optimize.

This is what manyverse mobile startup looks like:

Screenshot from 2021-02-16 13-01-28

It takes about 5s in total.

Around 2s, that's when the private post jitdb paginate query ends
In the very middle of the picture, you see a pink spike, that's when a jitdb paginate query for public posts ends

What's funny is that the private posts query is quick but the public posts query is slow:

jitdb:query paginate(AND(meta_private(),value_content_type(post))), seq: 0, limit: 1: 84ms, total messages: 288 
...
...
jitdb:query paginate(AND(meta(),OR(value_content_type(post),value_content_type(contact)))), seq: 0, limit: 1: 1152ms, total messages: 391518

Same story on desktop too:

jitdb:query paginate(AND(meta_private(),value_content_type(post))), seq: 0, limit: 1: 95ms, total messages: 3117
...
...
jitdb:query paginate(AND(meta(),OR(value_content_type(post),value_content_type(contact)))), seq: 0, limit: 1: 375ms, total messages: 516669

@arj03 Any idea what may be causing this?

staltz commented 3 years ago

@arj03 Just for your information, Chrome CPU profiler also annotates the source code with the time it spent in each line of code. This is from manyverse desktop, picturing the new ssb-friends:

Screenshot from 2021-02-16 13-20-01

arj03 commented 3 years ago

I'll have to try the private query to see. That reminds me, I don't think we have a benchmark of that.

staltz commented 3 years ago

I'll have to try the private query to see

Note, it's the public one that's slow. It's even weirder to me, because I'd expect the private one to be slow because it does a decrypt.

arj03 commented 3 years ago

Can reproduce. Let the debugging begin :-)

staltz commented 3 years ago

Good news: Manyverse now with db2 1.17.1 has a startup time of 4s

staltz commented 3 years ago

I'm copy pasting this table from #191 because it's useful and didn't want it to get lost.

initial sync duration	maxCpu	UI responsiveness
305s	86 (maxPause=300)	good
299s	90 (maxPause=300)	good
288s	92 (maxPause=300)	good
222s	92 (maxPause=180)	good
158s	92 (maxPause=90)	a bit bad
281s	95 (maxPause=300)	a bit bad
280s	99 (maxPause=300)	bad
126s	Infinity	bad

arj03 commented 3 years ago

Might be good to include these numbers in some form in the README.md documentation

staltz commented 3 years ago

We can make another .md file for these

staltz commented 3 years ago

What I conclude from maxCpu is that it basically slows down the task duration by 1.7x while providing better user experience. Have to factor that into the 4x speedup that db2 provided, meaning that in the end the db2 is 2.3x faster, but it's more like:

what	speedup	UI responsiveness and device temperature
db1	1x	slow and hot
db2	4x	slow and hot
db2 with maxCpu	2.3x	fast and cold

Temperature on mobile is a bigger deal because the device is in physical contact with the user.

ssbc / ssb-db2

Benchmark analysis #108

Stats

Ideas

Separate benchmark reports for migration

Don't strictly require indexing base index before db.query()

Allow customizing when should `decrypt` run inside `log.stream`

Push-streams could/should be paused

Concurrency

Priority system

Threads (workers)

DB2 Benchmarks running in Manyverse

ssbc / ssb-db2

Benchmark analysis #108

Stats

Ideas

Separate benchmark reports for migration

Don't strictly require indexing base index before db.query()

Allow customizing when should decrypt run inside log.stream

Push-streams could/should be paused

Concurrency

Priority system

Threads (workers)

DB2 Benchmarks running in Manyverse

Allow customizing when should `decrypt` run inside `log.stream`