ssbc / jitdb

A database on top of a log with automatic index generation and maintenance
50 stars 7 forks source link

Reindex after or during log compaction #199

Closed staltz closed 2 years ago

staltz commented 2 years ago

async-append-only-log now supports log.compact() https://github.com/ssb-ngi-pointer/async-append-only-log/issues/48 and we should reindex the relevant portions of jitdb indexes. See also https://github.com/ssb-ngi-pointer/ssb-db2/issues/306

staltz commented 2 years ago

@arj03 I'm working on this already, but I bumped into a design issue. My original plan is to rebuild all indexes while log compaction is still ongoing, based on what Dominic suggested here:

Instead of writing a separate second log, superimpose the compacted log on top of the old one, keeping track of the min uncompacted sequence number. When you do a query from the old indexes, drop results with offsets smaller than the uncompacted sequence, and mix with results from the new indexes. merge index responses by sequence numbers, so numbers smaller than the uncompacted sequence are from the compacted section and the numbers greater are from the uncompacted log.

For instance, my plan is to create a folder .ssb/db2/indexes/.compacting where the new index files exist, and once those are done catching up with the log, we delete all jitdb indexes in .ssb/db2/indexes and move up the files in .ssb/db2/indexes/.compacting to .ssb/db2/indexes.

However, does this go against the JIT design of JITDB?

Alternatively, should we just delete all JITDB indexes once log compaction is done, and let the JIT nature take care of rebuilding them?

arj03 commented 2 years ago

Will get back to you on this tomorrow :)

arj03 commented 2 years ago

For JITDB I don't think we shouldn't allow queries while compaction is running, instead keep track of the offset from which you started compacting and on done call reindex() with that. This way if you compacted only the end of the log it should be quite fast. For level indexes in db2 it might make sense to allow lookups for things we know are before compaction. I do believe that we should have a "high-level" compact function in db2 that takes care of doing these things. Because it also needs to keep track of state while compacting, so that we can continue if the app is shut down while compacting.

This could be a starting point for the calls to jitdb & level indexes, it's a bit different because here we are running on individual messages.

staltz commented 2 years ago

@arj03 Thanks!

reindex() currently doesn't rebuild the core indexes (seq, sequence, timestamp), is there a reason for that? I think with compaction we would also have to rebuild the core indexes.

Why do you say that level indexes should be allowed to lookup while jitdb shouldn't? Is it because of speed of rebuilding the indexes? (By the way, in my tests, I felt that some jitdb indexes do take a considerable time to build, like value_author.32prefix, which is 6 MB in a production database, and has to write 4 bytes for each record.)

arj03 commented 2 years ago

reindex() currently doesn't rebuild the core indexes (seq, sequence, timestamp), is there a reason for that? I think with compaction we would also have to rebuild the core indexes.

Right, that is because for encrypted they don't change. Here they do :)

Why do you say that level indexes should be allowed to lookup while jitdb shouldn't?

Just because it is easier. We don't need to flag to jitdb that says: please don't update indexes, just run on what you have.

I can maybe check the value_author prefix building later later to see if there is anything we can do to optimize that one.

staltz commented 2 years ago

Okay thanks, so I'm preparing for the plan to "not allow queries while compaction is running", and here are some disorganized thoughts (mostly questions to myself):

What happens when a jitdb query is ongoing and compaction starts? Do we cancel the query, or do we pause it and re-run it after compaction (and reindexing) is done? What if it was a 4th call to paginate and the starting seq would now mean that the 4th paginate would get results that were in the 3rd paginate (or, worse, would skip some results).

This is a bit similar to questions about log.get and log.stream: do we abort it or do we wait for compaction to end and rerun it?

staltz commented 2 years ago

I can maybe check the value_author prefix building later later to see if there is anything we can do to optimize that one.

Just as a reference, here are other (leveldb and jitdb) indexes of similar size (sorted). There are plenty of big jitdb indexes, computing them all and writing to disk isn't fast.

Index Size Type
search2 101 MB level
ebt 21 MB level
contacts 12 MB level
timestamp 12 MB bitvector
value_content_vote_link__map 6,2 MB prefix map
seq 6,0 MB bitvector
sequence 6,0 MB bitvector
value_author 6,0 MB prefix
fullMentions 5,1 MB level
hashtags 5,0 MB level
aboutSelf 3,0 MB level
value_content_root__map 3,0 MB prefix map
value_content_contact__map 2,6 MB prefix map
base 1,3 MB level
value_content_fork__map 218 KB prefix map
meta_ 173 KB bitvector
meta_private_true 173 KB bitvector
value_author_MYSSBID 173 KB bitvector
value_contentroot 173 KB bitvector
value_content_type_contact 173 KB bitvector
value_content_type_post 173 KB bitvector
value_content_type_pub 173 KB bitvector
value_content_type_roomx2Falias 173 KB bitvector
value_content_type_vote 173 KB bitvector
canDecrypt 24 KB bitvector
encrypted 64 B bitvector
staltz commented 2 years ago

What happens when a jitdb query is ongoing and compaction starts?

Answering myself: if there are ongoing queries (of any kind), postpone compaction, such that compaction always starts when everything else in the database is idle, and then from that point onwards, queue all incoming queries so that they apply only after compaction is done.

arj03 commented 2 years ago

Yep agree. Also for most of these indexes you wouldn't need to do a full rebuild. Only from compaction start and onwards.

arj03 commented 2 years ago

I felt that some jitdb indexes do take a considerable time to build, like value_author.32prefix, which is 6 MB in a production database, and has to write 4 bytes for each record.)

I had to check this. Building author or type is around the same amount of time (40s). Note this is from totally empty indexes folder, so includes keys and base. Rebuilding author after type is 6s. Rebuilding both are 45s. So a tiny bit less. I tried disabling writing the jitdb indexes and it only took rebuilding from 45s to 43.2s. Decrypting is still very heavy as documented here at around 10s overhead. If I leave the canDecrypt file and level indexes the time to build both goes down to 18.7s.

staltz commented 2 years ago

Super weird bug with GitHub where I can't edit the original post TODO list, so I'm copying it down here:

staltz commented 2 years ago

@arj03 I think I hit a pretty sad issue: it seems like we have to reset all leveldb indexes after compact happens, and that's because most leveldb indexes hold state, and we don't know how to undo the state only for the deleted records.

Consider e.g. the about-self index which has key/value feedId => hydratedAboutObj. Suppose we deleted an about message with the name field, but we still have an about message with the image field. There's nothing in this index that tells us how to "partially" delete the hydratedAboutObj, so either we get the wrong reduced state, or we have to do a full rebuild.

Any ideas about this?

arj03 commented 2 years ago

@staltz right, reduce based indexes versus pure. I think we could introduce that abstraction and then you'd only have to do a full reindex on the reduce based ones. Most base indexes are not reduce.

staltz commented 2 years ago

Yeah, we could do that split. But it might still be hard to find and remove the outdated entries. Like the EBT index, I think it's [feedId, sequence] => offset, we need to look up the old ones via the right-hand side, i.e. the value, not the key.

arj03 commented 2 years ago

Oh right, you are correct. I guess there isn't any other good way than to reindex everything in this case :(