palantir / atlasdb

Transactional Distributed Database Layer
https://palantir.github.io/atlasdb/
Apache License 2.0
52 stars 9 forks source link

sweep leaves sentinels and tombstones forever #2398

Open nziebart opened 6 years ago

nziebart commented 6 years ago

Currently, every table defaults to SweepStrategy.CONSERVATIVE. This strategy leaves behind at least two values for every cell swept:

  1. a sweep sentinel value (used as a marker for readOnly transactions to know that the value they want has been swept)
  2. the latest committed value (even if it's a tombstone)

Note that leaving the latest value is essentially necessary if we leave sentinels, otherwise every query for the cell would return the sentinel and be forced to abort/retry.

This behavior is "correct" because it will never violate snapshot isolation guarantees (a read transaction that started at time t will always either get a fully consistent snapshot as of t, or read a sweep sentinel and throw an error). If we were to remove the sentinel, it's possible that some super old read transaction could try to read a cell, and mistakenly think that the value is empty, when in reality the value for its start timestamp has been swept.

I think we need to strike a balance here between infallible correctness and actually being able to clean up old tombstones. For example, a well-behaved app client that very rarely deletes cells but still uses range scans would slowly, over time, become infinitely slow due to the number of tombstones/sentinels it reads.

One idea would be to enforce a time limit on readOnly transactions (i.e., throw an error if you have been open for longer than the limit), and allow sentinels and tombstones to be deleted after that limit.

(On another note, I'm not sure if we currently guarantee correctness when using SweepStrategy.THOROUGH. Using this disallows readOnly transactions , but still allows read/write transactions which might not actually do any writes, which are effectively the same if their immutableTs lock is lost)

nziebart commented 6 years ago

@fsamuel-bs @gbonik do you guys have insights on the last paragraph in the issue? (re SweepStrategy.THOROUGH)

gbonik commented 6 years ago

Seems like we don't actually guarantee correctness for THOROUGH (SnapshotTransaction.java):

private void commitWrites(TransactionService transactionService) {
        if (!hasWrites()) {
            return;
        }

We could probably mitigate it by checking the immutableTs lock even when we have no writes, so we throw at the end of the transaction block. This, of course, still doesn't save us from the application code that modifies some non-local mutable data structure within a transaction (we'd need to check the lock after every read I guess).

nziebart commented 6 years ago

cool so some options here might be:

  1. Allow setting a "gc grace period" on a table for CONSERVATIVE sweep. If a transaction remains open for longer than this and has read from the table, throw an exception (technically not correct as it depends on clocks being accurate, though we could probably invent a way to rely on nanoTime rather than wallclock time)

  2. Force write transactions to re-check immutableTs lock at the end even if they haven't done writes. Essentially just a minor fix to the current way we use THOROUGH strategy. Not the greatest as it requires clients to have special knowledge about this, and forces use of write transactions, which have a small perf overhead of locking the immutableTs (though overhead could be minimized by having multiple transactions share the same lock)

I'm not as concerned about users modifying data in the middle of a write transaction - that is bad practice and not a whole lot we can do to prevent it.

gbonik commented 6 years ago

I'm not as concerned about users modifying data in the middle of a write transaction - that is bad practice and not a whole lot we can do to prevent it.

I agree we can't do much to prevent it (other that checking the lock after every single read), but a user might not realize the problem exists. They could start with a read-only transaction, then realize that THOROUGH forces them to use a "regular" transaction, so they switch from runTaskReadOnly() to runTaskThrowOnConflict(), but still treat it as a read-only transaction. It's just counter-intuitive that we throw on commit rather than on read. You get some data back, but only later it turns out it wasn't valid.

nziebart commented 6 years ago

Maybe @carrino has some thoughts here as well.

Maybe it actually isn't the worst thing to make every transaction lock an immutable timestamp.. the overhead could reduced to one call to check it's still valid, as we could share the lock across multiple short transactions.

This would allow sweep/scrub to clean up tombstones very efficiently.

nziebart commented 6 years ago

After thinking about this more, I think we need to have some grace period after which we GC sentinels and tombstones.

Read transactions can check how much time has elapsed using nanoTime after every read, and throw if they exceed the grace period.

I'd really like to use wall-clock time here (i.e. the punch table) and just make the grace period sufficiently large. But, if we really want to protect against wall-clock skew, we can implement a time service which writes time intervals to the KVS using nanoTime. This will result in a measurement of time that almost certainly underestimates the "real" elapsed time, but does not overestimate it.

nziebart commented 6 years ago

An alternative here would be to have a lock, similar to the immutable timestamp lock, but likely distinct, which all read transactions acquire. We can then know for sure the age of the oldest read transaction.

This allows us not only to sweep thoroughly (i.e., remove sentinels and tombstones), but also to sweep very quickly after a value has been written. For example, if there are were other active transactions, we could actually sweep immediately after writing.

This has two drawbacks, which I think can be solved: 1) Adds extra overhead to read transactions. However, we can actually share the lock across multiple transactions, bringing the average overhead down to just 1 RPC to timelock to check that the lock is valid 2) It will result in read transactions retrying if the lock is expired at the end. This means that long-running reads have a higher chance of having to retry, and requires that your read transaction be idempotent.

We can make this feature opt-in due to the above drawbacks, but for most users I think these aren't very impactful.

gbonik commented 6 years ago

I think one problem here is aggressive hard delete - it needs to wipe the values "right now" regardless of existing read transactions. Currently, for each wiped cell it writes a sentinel that will invalidate any read transaction that encounters it. If we don't write sentinels, we lose that mechanism. I guess we can still write sentinels when hard-deleting.

nziebart commented 6 years ago

Yeah we could continue to write the sentinels in cases where we don't want to wait for transactions to finish. The proposal here would allow us to sweep these sentinels async, once the transactions did finish.

gbonik commented 6 years ago

How would you tell if a sentinel can be removed safely? (A sentinel means transaction timestamp of -1)

nziebart commented 6 years ago

Hmm, I guess in the current conservative sweep world, you always have a sentinel + a latest committed value. In that case you can remove the sentinel when the oldest reader is past the timestamp of the latest committed value.

But if hard-delete ONLY writes a sentinel, we'd need to ensure that all transactions are past the timestamp of the value that was deleted and necessitated the sentinel. And we don't know that timestamp.

So we'd either need to write a sentinel + a tombstone (do we do this already?), or sweep would have to do some weird logic where it gets a fresh timestamp and waits for the oldest reader to progress past it.