It should be possible to create a special entry that indicates "delete a range", which causes that range to simply not be read.

The special entry can be compacted away.

This will be very difficult.

Rationale

Sometimes you have a lot of records that you want to delete. Right now, you can delete them with a strategic compact --gegnum. For example, I often do compact -M --gegnum 'awk { if ($1 ~ /^prefix/) { } else { print } }' to delete all the records with keys that begin with prefix. This awk-based command can also filter records based on time and date. However, this is slow (linear time). You have to wait for the entire database to be scanned and rewritten before the deletion is complete.

This proposal instead suggests a different kind of record that marks a range to be deleted. This would be a small amount of data that instructs the reader to "ignore these records". That record is written in constant time and quickly, it essentially takes effect immediately and still conforms to all the transactional expectations.

Implementation

I propose a record with a format string of simply "\u007F", (i.e., the "DELETE" character).

The code in merge.rs would have special behavior when it sees the DELETE record; it holds it and applies its effect to each record that meets the criteria encoded in that DELETE record. When it can prove that no more following record can possibly meet the criteria of deletion, it can remove it. More than one deletions can be in effect at any given time, and so it may need to apply and hold multiple deletion operations.

To prevent complexities about having to read-ahead in a transaction file (or in that segment), a DELETE record must be the only record in both its segment and its file-of-segments.

A DELETE record has a key that represents the first key it can affect. This means that we don't need to make any change to existing code to be able to start processing DELETE commands at the right time. We want to be able to delete wildcards, so this is just the wildcard's prefix
The format is simply "\u007f"
The first 8 bytes represent the first timestamp to delete. Storing 0 there obviously means "Start at the beginning of time"
The following 8 bytes represent the last timestamp (inclusive) to delete. Of course you can store u64::MAX there.
To enable wildcard deletions, store the actual wildcard-syntax string, using varint-string encoding with the actual % symbol. An empty string indicates "don't use this field".
Also, to enable range deletions, we can store the last key that we want to affect, using the varint-string encoding like we use for everything else. This can be an empty string to indicate "don't use this field". The last key to delete can therefor never be actually empty to mean "empty", because the empty string always comes lexicographically prior to any other key; so we may never write those.

Here you see that there are 5 distinct fields that can be used to do deletions, and they must all simultaneously complied with:

The record's actual key representing "first key"
First timestamp
Last timestamp
Key wildcard
Last key

All of these fields are also possible to, somehow, mark as "we don't actually delete on this criteria".

When we are reading a database and processing a deletion, only the "last key" and "key wildcard" fields can actually have the effect of causing the Merge code to forget about the deletion.

When a major compaction is done, the DELETE command is applied, and then the actual DELETE record is removed in the final task. However when a minor compaction is performed, the DELETE record, and therefor its transaction file, must be retained.

API

The CreateTx struct shall be extended with a delete function that accepts the parameters specifying the criteria for deleted records.

All of the reading functions in DatabaseReader will transparently apply the deletions and never generate the record representing the deletion, except under the circumstance where the DatabaseReader is being used while in the process of a minor compaction (as the record must be retained). I don't know how that mode should be indicated, maybe when include_main_db==false?

All of the internal structs, such as in key_reader.rs will necessarily need to output the DELETE records.

User Interface

In the sonnerie CLI tool, the command can be called delete and has the same parameters as read. It produces no other output other than explaining why it may have failed (which can only be due to impossible constraints like --first-key came after --last-key.

Future Possibilities

I do not propose a way to delete records based on their value. That could be possible one day. Or maybe that's better to do as part of a compaction and just add a special UI for that that's a little easier than awk.

Per this proposal, a deletion must be the one and only record in a transaction file, that means that trying to perform a delete after any record has been added is invalid. Instead, we could have it rewrite the transaction on the fly (i.e., in linear time), while having the first record in the transaction be the DELETE so it can still be applied to previous, already-committed transactions.

njaard / sonnerie

Allow realtime erase #12

Rationale

Implementation

API

User Interface

Future Possibilities