[RFC] Binomial data files

craigfe commented 3 years ago

This describes a potential significant improvement to the write amplification properties of large indices, at the cost of bounded read performance.

Currently, Index uses a very simple scheme for maintaining a sorted sequence of entries: all updates first go to the log, and then once the log exceeds a certain size we merge it with the data file. This has the unfortunate property that each merge is O(n) in the total number of bindings written so far (and each merge is O(1) apart, leading to O(n²) write amplification). In an index with 700 million bindings and log_size = 10 million * len(entry), we rewrite 690 million redundant entries per 10 million writes (and this gets worse at a linear rate).

A different scheme might use more than one data file, allowing the sorted property to be only partially restored on each merge. We could arrange these data files in a sequence and size them in powers of two. Each merge then reconstructs the smallest missing data slot by doing a k-way merge on the log + the (k-1) data files below it:

With this scheme, the amortized time complexity of merges is:

(           log_size  ;; we always have to sort the log file and add it to data
  + 1/2 * 2 log_size  ;; ... half the time, the lowest slot is filled, so we merge with it
  + 1/4 * 4 log_size  ;; ... one quarter of the time, this cascades into merging with the next slot up etc.
  + ...
  + 1/n * n  ) * log log n  ;; extra factor due to k-way merge over log(n) many data files w/ min-heap
                            ;; (happens in memory, so likely not the bottleneck)
= O(log n * log log n)

... with the worst case still being O(n) (or more precisely O(n log log n)), for a merge that happens each time the index doubles in size.

Advantages

Write amplification reduced from O(n²) to O(n log n). On a Tezos-sized index, we can expect the average duration of merges to decrease by a factor of ~ 30. (!)
Write times will have a lower variance, but a much longer tail.

Disadvantages

The read performance with this layout would need to be O(log n), since the data files must be read in sequence.
Implementation is more complex.
Merge times will have a much higher variance.
In the general case, need to consider that len (merge a b) can be less than len a + len b due to duplicate bindings, so data file sizes will not be nice powers of two.

Ngoguey42 commented 3 years ago

Nice design! :)

What happens exactly when log_async is full?

Say that:

we launch a merge of log + data1 + data 2 + data 3 into data 4.
the main thread fills log_async before the merge ends
is the main thread stuck until the end of the merge, at which point a new merge of log into data 1 can be launched?

craigfe commented 3 years ago

The sequence you describe feels like the "default" result of adapting the current merge stacking semantics to this scheme, and it sounds reasonable to me.

It's possible that there are cleverer merge strategies that try to merge smaller data segments first in order to unblock the log_async. Unless there's an obvious win here, I'm tempted to say that effort spent in that direction would be better spent just making the merges faster so that they stack less often.

Ngoguey42 commented 3 years ago

This is exactly what I wanted to point out: there may be clever merge strategies.

My insight is there may also be clever parallel merge strategies, where several merges could take place at the same time, on separate domains. I don't have a specific design in mind, I just thought it was worth mentioning.

pascutto commented 3 years ago

That would indeed be possible. Please note that this also leads to more variance, in other words, less predictability (particularly for worst case scenarios), and we also have to consider integrity and availability issues under our concurrent context. I agree with @CraigFe that making the (batch?) merges faster would be an easier alternative.

mattiasdrp commented 3 years ago

Since I'm not sure I understand the second solution, I'll try to explain what I understood.

The current version was straight-forward:

a data file containing almost all the entries (sorted)
a log file containing the most recent entries (sorted)
once log is big enough, its content is merged (k-merge since both are sorted) into data and log is emptied

In the following image describing the new version:

data₁, data₂, ..., data_n with
- sizeof(data_i) = 2 * sizeof(data_{i - 1})
when we want to merge log in data_k we k-merge it with all the data_j with j < k
- each data_j file that has been merged is emptied
log is always merged in the smallest empty data file
by design an entry is present in at most one file
- looking for an entry implies to read all the files from log to the last data file in the worst case
- are we sure the complexity of this operation is only O(log n) because this is the same complexity as the first version, no? (binary search in a sorted file) (I thought it would be log(n)²)

Thanks for the RFC ;-)

pascutto commented 3 years ago

* when we want to merge `log` in datak we k-merge it with all the dataj with `j < k`

* `log` is always merged in the smallest empty `data` file

log always tries to be merged into data_1. If data_1 is already present, they are both merged into data_2, etc. until an empty data_k slot is found. Hence the k-way merge happens on log + data_{1..k-1} into data_k, yes :+1:

* by design an entry is present in at most one file

This is only true if bindings cannot be overwritten. The current semantics is replace, so a key may be in the log and the data file with a different value, the data binding being erased on the next merge. Similarly, there could be multiple occurrences of the same key in multiple data files due to replacement, and only the most recent binding should be kept when merging. That being said, I don't think this replacement feature is used in production anywhere, so we could also drop it if necessary.

  * are we sure the complexity of this operation is only `O(log n)` because this is the same complexity as the first version, no? (binary search in a sorted file)

Actually, searching in a data file is O(1) thanks to the fanout, that lets us do a single read per find (the interpolation search is happening in a region of ~constant size); since the new layout has log n data files, the find complexity is O(log n).

mattiasdrp commented 3 years ago

This is only true if bindings cannot be overwritten. The current semantics is replace, so a key may be in the log and the data file with a different value, the data binding being erased on the next merge.

Ah yes, of course!

Actually, searching in a data file is O(1) thanks to the fanout, that lets us do a single read per find (the interpolation search is happening in a region of ~constant size); since the new layout has log n data files, the find complexity is O(log n).

Oh right, I forgot about the fan-out, thanks for the clarification

pascutto commented 3 years ago

A note about these binomial data files: in practice it also degrades the performance of additions in the index by irmin-pack, since we always ensure an entry is absent before adding it (see here). So effectively each replace is preceded by a call to find, which would now be degraded. (perhaps it's time for bloom filters to shine again!)

Ngoguey42 commented 3 years ago

Could the reads be improved by some kind of disk pre-fetching?

In practice, it would mean querying the data_k-1 file before knowing if the desired entry is missing from data_k.

samoht commented 3 years ago

As a side note: @let-def cleverly noticed that the shape of the data files is exactly the binary representation of the number of merges, where 1 is in position n of the binary representation iff data_n exists. Which means that the merge between multiple data files only happens when the number of merges is a power of 2.

mattiasdrp commented 3 years ago

In practice, it would mean querying the data_k-1 file before knowing if the desired entry is missing from data_k.

Aren't we trying to read from data_k-1 before data_k since an entry can be present in multiple data files and we have to return the most recent one? (it doesn't change anything about pre-fetching but you would pre-fetch bigger files while reading smallest ones)

let-def commented 3 years ago

As a side note: @let-def cleverly noticed that the shape of the data files is exactly the binary representation of the number of merges, where 1 is in position n of the binary representation iff data_n exists. Which means that the merge between multiple data files only happens when the number of merges is a power of 2.

Pursuing the analogy with binary numbering, it is possible control the number of merges by using a suitable numbering system (e.g https://en.wikipedia.org/wiki/Balanced_ternary). This allow to lazily merge so that we never have to suddenly merge e.g. 30 files. However the size of individual files will grow (however in my benchmarks with a custom repr, the more files to merge the faster the process, until at least 8-way merge)

Ngoguey42 commented 3 years ago

Aren't we trying to read from data_k-1 before data_k

If half of the entries are contained in data_k, I would read this one first, just because there is a higher probability that the value resides in this one. (Ignoring the problem of duplicate entries)

mirage / index

[RFC] Binomial data files #299

Advantages

Disadvantages