ssbc / ssb2-discussion-forum

not quite tiny, also not quite large
17 stars 1 forks source link

In-memory SDK #14

Open staltz opened 1 year ago

staltz commented 1 year ago

This is not minibutt-specific, but this repo is like a place to have these discussions. This issue is an SSB2 idea.

ssb-db2 is cool, but some people want SQLite. That's one problem when it comes to having an SDK to create apps.

The other problem is with aggregating indexes, you can't use jitdb and have to create a custom leveldb index. It's doable, but you have to know what you're doing and it's not a simple API.

One of the guiding wishes for SSB2 is that log.bipf would never go above 100MB.

With such a small log, it may be realistic to just scan the whole log instead of using indexes.

On the other hand, scanning the log isn't a friendly API either.

Since we have this 100MB guarantee, WHAT IF we just load the whole log.bipf in memory? As one big array. Just not too big. Then you can have nice APIs that are in-memory and for that reason are even synchronous.

Basically 100MB is a 10x step forwards in performance compared to today's 1GB, and doing naive in-memory database is a ?x step backwards in performance, so the result is a pretty okay performance. There are lots of things we can do more easily if we're operating on small-world datasets. Naive code becomes suddenly a lot easier to write and maintain.

This would mean we wouldn't have jitdb neither ssb-db2, just async-append-only-log and some new in-memory query layer on top of that, which loads the whole log in memory (we have to take into consideration that JSON objects in JS will take marginally more space than bipf in disk, though. So more than 100MB in RAM). But it might not be too much RAM. For instance, already today, ssb-friends takes about 30MB of RAM, and that's just ssb-friends. In addition to that, there is jitdb structures in memory, leveldb portions in memory, and async-append-only-log blocks cached. We could drop all of that and have a single log that is reflected in-memory as an array.

For peers who want a log.bipf larger than 100MB, then they can employ a database such as ssb-db2 or SQLite. We could call that the "default" mode, whereas this in-memory case would be the "lite" mode.

staltz commented 1 year ago

Oh, this is exciting:

If 1GB today contains content since August 2015, and that's 2764 days, then it means that in proportion 100MB can contain content for ~9 months. That's a lot! More than enough.

We could even aim for 64MB, which would be ~5 months of content. Also well enough for our needs.

gpicron commented 1 year ago

The most important for the us sdk is to have a clear and stable interface définition for the current db2 with no internal shortcuts starting from current ssb-db2. On there is that, one can test various approaches

staltz commented 1 year ago

I started an experiment where I am coding a prototype of this idea, and benchmarking it against ssb-db2.

I am testing this in-mem db versus db2 on two datasets: one with a 64MB log.bipf, and another with a 100MB log.bipf. I also want to try a 1GB log.bipf, but I'm having technical problems with ssb-db2, and I'm pretty sure ssb-db2 is going to be clearly better than this in-mem db for the 1GB case. Anyway, here are some exciting results:

(UPDATED to include decryption during startup for in-mem db)

db2 RAM db2 indexing db2 startup in-mem RAM in-mem indexing in-mem startup
64MB log 187 MB 13032 ms 1025 ms 163 MB N/A 2357 ms
100MB log 236 MB 26140 ms 1748 ms 220 MB N/A 3450 ms

"in-mem indexing" doesn't exist, because it has no indexes! It just loads up the whole log.bipf as msgs into an array in memory.

Then, queries can be done by simply streaming through the array (for-loop or pull stream) and picking messages that match what we want.

db2 64MB log db2 100MB log in-mem 64MB log in-mem 100MB log
Query all my posts 143 ms 387 ms 13 ms 18 ms
Search profile names 2 ms 1 ms 27 ms 33 ms
Collect my follow list 1 ms 1 ms 21 ms 43 ms
Get my profile details 1 ms 1 ms 26 ms 39 ms
Collect 100 posts that mention me 7 ms 6 ms 16 ms 32 ms
Calculate storage size of all feeds 131 ms 229 ms 79 ms 113 ms

Yes, in-mem queries are an order of magnitude slower, but they don't seem prohibitively slower. And one can always "warm up" some queries by running them beforehand and having the results ready at hand, i.e. in-memory indexing. This will just have a small impact on the startup time, which already isn't that bad.

An example query, to get my profile details:

const profile = {};
pull(
  ssb.db.filterBy(
    (msg) =>
       msg?.value?.content?.type === 'about' &&
       msg.value.author === ssb.id &&
       msg.value.content.about === ssb.id,
  ),
  pull.drain((msg) => {
    const {name, description, image} = msg.value.content;
    if (name) profile.name = name;
    if (description) profile.description = description;
    if (image) profile.image = image;
  }),
);
gpicron commented 1 year ago

Did you take into account the deserialisation to js objects in your measurements of db query time ?

staltz commented 1 year ago

@gpicron Yes, bipf=>js objects happens at startup, and in total startup takes about 1.5 seconds. After startup, all the messages are JS objects in memory, and no deserialization needs to happen at query time.

staltz commented 1 year ago

Okay, one blindspot this benchmark has is that there is no private message decryption at all for the in-memory db. Of course decryption is going to have an impact on the startup time.

staltz commented 1 year ago

Added decryption. Not too bad startup time! Startup: 2631 ms for the 64MB log. That's basically just +1300ms.

And because startup is based on scanning the log, this is the kind of thing that we can easily add a progress bar for. It's not too bad a UX to wait for a progress bar to fill up in 3 seconds before beginning. PS: on mobile this might be 5 seconds.

staltz commented 1 year ago

Actually, I brought startup down to 1783 ms (64 MB log) because I had a query mistakenly included in the startup time calculation. This is great!

gpicron commented 1 year ago

@staltz I meant, looking at the response time of ssb-db2, I was wondering if was including the deserialisation. Because intuitively it looks too good to include it in the DB case, not in the in memory case.

What is fun is that made this kind of in-memory db proposal loaded at start and implemented a prototype of it 3 years ago.

%ZDKxP1Ng/tWWokxKJh4BT8BmpJR8ZWF3UoBHBmcQiKQ=.sha256

https://github.com/gpicron/yaii

But as it didn't seemed to interest so much...

staltz commented 1 year ago

@gpicron I didn't have time to explore all the conversations on SSB, so YAII is new to me. But it seems like @arj03 commented on it when you had published it:

If only someone would write that so that it runs in the browser, seems like that is where you are going with this. (...) So I'm here to cheer you on and are very interested in what you come up with.

I can't speak for him, but based on the dates, you published YAII right before arj made jitdb, so it seems to me that you inspired him. Also, he eventually did manage to put ssb-db2 in the browser: https://github.com/arj03/ssb-browser-core/

staltz commented 1 year ago

Okay, here is some code in case anyone wants to see the evidence: https://github.com/staltz/bench-ssb64

cc @arj03 I'm dying to hear what you think about this approach.

arj03 commented 1 year ago

Looking at the numbers I'm glad to see that we did something right with db2 :) I'm all for experimenting at this stage, before going down some route. So I cheer you on. This is also why I havn't done any update to the document for a while. It was a good starting point for discussions and experimentation, later we can check back in and make some kind of consensus om what kind of system we want to build given what we have learned.

gpicron commented 1 year ago

@gpicron I didn't have time to explore all the conversations on SSB, so YAII is new to me. But it seems like @arj03 commented on it when you had published it:

If only someone would write that so that it runs in the browser, seems like that is where you are going with this. (...) So I'm here to cheer you on and are very interested in what you come up with.

I can't speak for him, but based on the dates, you published YAII right before arj made jitdb, so it seems to me that you inspired him. Also, he eventually did manage to put ssb-db2 in the browser: https://github.com/arj03/ssb-browser-core/

That was not a reproach or a complaint. Just I was finding fun how ideas appears, disappears and come back. At that point in time, after implemented the proto, all in mem was feasible but concluded that implementing this in JS was not a good idea. But then I had the opportunity to work with sqlite (and FTS5 extension) and found that you can hack and tune it so easily without the complexity to deal with transactions, WAL, storage and portability and yet have the performances, that I totally abandoned that idea to reimplement a DB from scratch (in memory or not). I had completely forgot it.

gpicron commented 1 year ago

Just, my wish if you are ok is that you define a clear interface in something like TypeScript.