martinsumner / leveled

A pure Erlang Key/Value store - based on a LSM-tree, optimised for HEAD requests
Apache License 2.0
355 stars 32 forks source link

Slow queries when index entries deleted #311

Closed martinsumner closed 4 years ago

martinsumner commented 4 years ago

If there is a large leveled store, then we add a lot of index entries, then delete those index entries - then the time taken to conclude an index query will be proportional to the number of removed entries not the active entries.

This is because leveled tombstones in the ledger are maintained as the ledger is compacted until the tombstone reaches the basement. When it reaches the basement, the tombstone will be reaped - as it is now clear that there are no further objects to be erased.

So if we add a {Key1, V1} then alter by adding {Key1, V2} then delete with {Key1, tomb}

The tomb must be maintained in the store even after it has merged through {Key1, V2} so as to not resurrect {Key1, V1}. Only when a tomb is merged into the basement level do we know it has no more victims to kill and so the tomb can be erased i.e. the tomb will not be written to the basement level.

There may be certain workloads that result in very infrequent merges to the basement, and hence there may be large files full of tombstones that are repeatedly scanned for queries to discover all the tombstones (which don't result in any results). So the result may take a long time to return even when there are very few actual results.

martinsumner commented 4 years ago

Within basho leveldb, this situation could also arise, but is more than likely resolved via grooming compactions. Proactively pushing files with large numbers of tombstones towards the basement.

https://github.com/basho/leveldb/wiki/Mv-aggressive-delete

martinsumner commented 4 years ago

There may be an additional problem related to the switching of files. When a file in Level Basement + 1 is picked to be merged, if it doesn't overlap with any file in Level Basement, then it will be switched to the Basement rather than undergoing a merge.

This switching process would mean that tombstones are not reaped (unless a file in the future is merged from Basement + 1 which does overlap).

This is a problem that we have not seen in production, and would require some peculiar circumstances to create.

The more general problem of tombstones at Basement + 1 has been observed in production.

martinsumner commented 4 years ago

There are a number of alternatives here:

  1. Treat index keys differently, as they don't have modifications - and drop them on the first merge with an older key (as they have no value).

  2. Force a proportion of compactions to be grooming compactions (ones that favour the SST file with the most tombstones).

  3. Add additional grooming compactions (potentially triggered only when the clerk has had no work for a period).

Option 1 may encounter issues with TTL and index entries. The TTL is the value, so potentially you could have updated index entries where the TTL changes. This is not part of Riak implementation, but exists as a theoretical possibility in non-Riak implementations. It is also a more involved code change than the alternatives.

Option 2 will makes things better without necessarily solving the problem.

Option 3 may lead to fluctuations in performance. In implementing Option 3 the shape of the LSM tree would change out of hours (e.g. there would be gaps above the basement). This means as load then increases initial performance will improve, as new merges to the top of the tree won't force a cascade due to space in the tree. This is an improvement, but also a disadvantage as the supportable load may be less predictable and hence harder to measure.

martinsumner commented 4 years ago

Option 2 progressed for now