Closed staltz closed 2 years ago
@arj03 Do you have any thoughts on how to persist the minimum uncompacted offset
?
log.bipf.compaction
? (temporary tiny file while the compaction is WIP)Idea (2) is easier to do as a non-breaking change, but we probably need atomicity so we probably have to write both compaction metadata and new log state in one go, so maybe we have to do idea (1)?
Algorithm so far. FUB means "first uncompacted block (index)", I thought it made sense to compact one block at a time, since we write an entire block at once. There is also another variable here (I don't know what's the best name yet:) "first record in the uncompacted part of the log", which may be in any further block, not necessarily in the FUB.
I'll get back to you on this in about an hour
I like the idea of a separate file best. This is similar to a "log" in a database. So you basically use that to keep track of what you are doing. You might have to redo a block an extra time (in the case of a crash), but that should be okay, since they are rather small. I also like the idea of "dirty blocks" that needs to be reindexed. I'm not entirely sure compared to just keeping track of the uncompacted offset. With these dirty block you only need to do something with the compaction blocks and maybe the tip of the log if you move a whole block back. It also might be a good idea to stop replication (writes) while you compact, to make it easier to reason about. The blocks are rather small, so my hunch is that this compaction should be quite fast (< 5 secs) for compaction after deleting a feed.
I like the idea of a separate file best.
What about atomicity? Say we save the log before we save "first record in the uncompacted part of the log" (which I decided to call firstUnshiftedRecord
by the way) and then it crashes? Then when it loads up again, that record (suppose, "world" in the picture above) would be duplicated but AAOL wouldn't know that "world" has already been copied, and may end up copying it a 3rd time.
An important detail here is that AAOL doesn't have a notion of msg "keys" and allows any kind of buffers, so it's perfectly acceptable to AAOL to be foo, foo, ___, foo, foo
(where ___
is a deleted record, and foo
is <Buffer 66 6f 6f>
). So we can't check from the log whether we have copied a record already or not.
I also like the idea of "dirty blocks" that needs to be reindexed.
Can you explain what you mean with dirty block? Sounds like you have an idea behind the word but it's not clear to me.
What about atomicity?
The log needs to be written in such a way that you can always restart it. So maybe something like identity blocks be to fixed (this is what I mean with "dirty blocks"). Write those to the log maybe with the lowest sequence. Then start fixing the blocks, either one at a time or all of them. At this point the indexes should be threated as "needs to be reindexed from lowest sequence and forward". After the blocks are fixed either write to the log that they are fixed, or write a new atomic commit with the latest state that is that indexes needs to be rerun from lowest seq, but that no blocks needs to be fixed. And after all the indexes have run (or maybe have 1 sequence for each index) write that those have been fixed. This way no matter where you crash you always can restart. You might be doing some extra work, like reindex an extra time for redoing some blocks and extra time. That is okay, so long as everything you do is atomic. Because compaction can involve multiple blocks it needs to be the case that while you are in the "fix blocks" state (from the log), you always need to do that before you can do anything else including reindexing.
Oh, so you mean log.bipf.compaction
should itself be a log of actions to take? That's interesting. Yeah and I can see how it would help for atomicity. It should probably be a WAL, so it's the first thing that is written to disk.
By the way, I'm not thinking yet about indexes, because they're in JITDB, but it's good to take into account what mechanism to use to signal to JITDB already now.
Exactly. Might help to think of it in terms of a state machine with a log of actions as you write.
This is going to be a moving target, as I explore solutions for this.
Basic premise is: once there have been a couple of deletes (i.e. overwrites with zero bytes), we can produce a new log that has all the real records, minus the zero bytes.
The algorithm is an idea Dominic wrote here: https://github.com/ssb-ngi-pointer/ssb-db2/issues/306#issuecomment-1034479270
log.bipf.compaction
as WAL existing only during compactionsfirstUncompactedBlock
: first block that may have holes and needs compactionfirstUnshiftedRecord
: first record that has still not been "moved left" during compaction