Open craigfe opened 3 years ago
Nice design! :)
What happens exactly when log_async is full?
Say that:
log + data1 + data 2 + data 3
into data 4
.log
into data 1
can be launched?The sequence you describe feels like the "default" result of adapting the current merge stacking semantics to this scheme, and it sounds reasonable to me.
It's possible that there are cleverer merge strategies that try to merge smaller data segments first in order to unblock the log_async. Unless there's an obvious win here, I'm tempted to say that effort spent in that direction would be better spent just making the merges faster so that they stack less often.
This is exactly what I wanted to point out: there may be clever merge strategies.
My insight is there may also be clever parallel merge strategies, where several merges could take place at the same time, on separate domains. I don't have a specific design in mind, I just thought it was worth mentioning.
That would indeed be possible. Please note that this also leads to more variance, in other words, less predictability (particularly for worst case scenarios), and we also have to consider integrity and availability issues under our concurrent context. I agree with @CraigFe that making the (batch?) merges faster would be an easier alternative.
Since I'm not sure I understand the second solution, I'll try to explain what I understood.
The current version was straight-forward:
data
file containing almost all the entries (sorted)log
file containing the most recent entries (sorted)log
is big enough, its content is merged (k-merge since both are sorted) into data
and log
is emptiedIn the following image describing the new version:
sizeof(data
i
) = 2 * sizeof(data
i - 1
)
log
in datak we k-merge it with all the dataj with j < k
log
is always merged in the smallest empty data
filelog
to the last data
file in the worst caseO(log n)
because this is the same complexity as the first version, no? (binary search in a sorted file) (I thought it would be log(n)²
)Thanks for the RFC ;-)
* when we want to merge `log` in datak we k-merge it with all the dataj with `j < k` * `log` is always merged in the smallest empty `data` file
log
always tries to be merged into data_1
. If data_1
is already present, they are both merged into data_2
, etc. until an empty data_k
slot is found. Hence the k-way merge happens on log + data_{1..k-1}
into data_k
, yes :+1:
* by design an entry is present in at most one file
This is only true if bindings cannot be overwritten. The current semantics is replace
, so a key may be in the log
and the data
file with a different value, the data
binding being erased on the next merge
.
Similarly, there could be multiple occurrences of the same key in multiple data files due to replacement, and only the most recent binding should be kept when merging.
That being said, I don't think this replacement feature is used in production anywhere, so we could also drop it if necessary.
* are we sure the complexity of this operation is only `O(log n)` because this is the same complexity as the first version, no? (binary search in a sorted file)
Actually, searching in a data
file is O(1)
thanks to the fanout, that lets us do a single read per find
(the interpolation search is happening in a region of ~constant size); since the new layout has log n
data files, the find
complexity is O(log n)
.
This is only true if bindings cannot be overwritten. The current semantics is replace, so a key may be in the log and the data file with a different value, the data binding being erased on the next merge.
Ah yes, of course!
Actually, searching in a data file is O(1) thanks to the fanout, that lets us do a single read per find (the interpolation search is happening in a region of ~constant size); since the new layout has log n data files, the find complexity is O(log n).
Oh right, I forgot about the fan-out, thanks for the clarification
A note about these binomial data files: in practice it also degrades the performance of additions in the index by irmin-pack
, since we always ensure an entry is absent before adding it (see here). So effectively each replace
is preceded by a call to find
, which would now be degraded.
(perhaps it's time for bloom filters to shine again!)
Could the reads be improved by some kind of disk pre-fetching?
In practice, it would mean querying the data_k-1
file before knowing if the desired entry is missing from data_k
.
As a side note: @let-def cleverly noticed that the shape of the data files is exactly the binary representation of the number of merges, where 1 is in position n
of the binary representation iff data_n exists. Which means that the merge between multiple data files only happens when the number of merges is a power of 2.
In practice, it would mean querying the data_k-1 file before knowing if the desired entry is missing from data_k.
Aren't we trying to read from datak-1 before datak since an entry can be present in multiple data files and we have to return the most recent one? (it doesn't change anything about pre-fetching but you would pre-fetch bigger files while reading smallest ones)
As a side note: @let-def cleverly noticed that the shape of the data files is exactly the binary representation of the number of merges, where 1 is in position
n
of the binary representation iff data_n exists. Which means that the merge between multiple data files only happens when the number of merges is a power of 2.
Pursuing the analogy with binary numbering, it is possible control the number of merges by using a suitable numbering system (e.g https://en.wikipedia.org/wiki/Balanced_ternary). This allow to lazily merge so that we never have to suddenly merge e.g. 30 files. However the size of individual files will grow (however in my benchmarks with a custom repr, the more files to merge the faster the process, until at least 8-way merge)
Aren't we trying to read from datak-1 before datak
If half of the entries are contained in data_k
, I would read this one first, just because there is a higher probability that the value resides in this one. (Ignoring the problem of duplicate entries)
This describes a potential significant improvement to the write amplification properties of large indices, at the cost of bounded read performance.
Currently, Index uses a very simple scheme for maintaining a sorted sequence of entries: all updates first go to the log, and then once the log exceeds a certain size we merge it with the data file. This has the unfortunate property that each merge is
O(n)
in the total number of bindings written so far (and each merge isO(1)
apart, leading toO(n²)
write amplification). In an index with 700 million bindings andlog_size = 10 million * len(entry)
, we rewrite 690 million redundant entries per 10 million writes (and this gets worse at a linear rate).A different scheme might use more than one data file, allowing the sorted property to be only partially restored on each merge. We could arrange these data files in a sequence and size them in powers of two. Each merge then reconstructs the smallest missing data slot by doing a k-way merge on the log + the (k-1) data files below it:
With this scheme, the amortized time complexity of merges is:
... with the worst case still being
O(n)
(or more preciselyO(n log log n)
), for a merge that happens each time the index doubles in size.Advantages
O(n²)
toO(n log n)
. On a Tezos-sized index, we can expect the average duration of merges to decrease by a factor of ~ 30. (!)Disadvantages
O(log n)
, since the data files must be read in sequence.len (merge a b)
can be less thanlen a + len b
due to duplicate bindings, so data file sizes will not be nice powers of two.