Closed jonathan-gramain closed 9 months ago
Took me a while to grasp the idea. From our conversation on the GapSet
PR, I think the only case that we are trying to solve is the one where a gap comes in after the invalidation of a key. That's the only race condition we can have I think. If that's the case, we could simply buffer all of the invalidation for a certain amount of time, and use that to refuse incoming gaps that may have been illegally obtained. That would be a lot simpler but maybe I'm missing the bigger picture here. Am I missing something?
From our conversation on the
GapSet
PR, I think the only case that we are trying to solve is the one where a gap comes in after the invalidation of a key. That's the only race condition we can have I think.
I agree, I also don't see other cases that could cause a race condition. If a gap comes in before the invalidation of a key, we can remove it immediately from the exposed set.
If that's the case, we could simply buffer all of the invalidation for a certain amount of time, and use that to refuse incoming gaps that may have been illegally obtained. That would be a lot simpler but maybe I'm missing the bigger picture here. Am I missing something?
That would work I think, and would be conceptually a bit simpler to understand. Maybe also a bit less code to write but not sure as the current number of lines is already rather low (excluding tests obviously!).
I don't remember exactly my thought process to come up with the solution I have implemented, but I think that one thing I had in mind is that to cope with high-traffic platforms (both in PUT and rate of listings) it would be more efficient to have a staged approach. If I'm not mistaken, if we keep around the operations for a certain amount of time and use them to invalidate incoming gaps immediately (say, in a naive unoptimized implementation at least), we have potentially to check each operation's key against each gap that comes in, so a complexity of "number of candidate listings gaps" x "total number of operations" during the chosen time frame. Obviously the real complexity all depends on the exact values of parameters and the amount of traffic in both listings and operations, but I have the feeling that it is potentially a CPU and performance eater in the worst case scenarios, i.e. exactly when we want to save the CPU as much as possible. Our largest platforms can have on the order of 4K op/s on versioned buckets per raft session (so about 8K DB keys per second, usually stored in batches of up to 50 entries). They can also have multiple bucket listings in parallel at the same time, leading to potentially tens of gaps to create per second if the gap size is relatively short (say, 1000 keys).
The staged approach can optimize this because in the worst case we do this processing only once per period. Sure we have potentially multiple gaps to invalidate at once but since they are stored in a temporary GapSet, they scale better, both because the GapSet uses an expected-cost of O(log_n) data structure and because there are simple optimizations in removeOverlappingGaps()
that can leverage their efficiency with more gaps, (especially chained gaps that would lie outside the updated range, which must be a very common case in practice with just a basic listing progressing and creating chained gaps as it goes).
On the solution you suggested, I think node.js performance with thousands of pending timers should be good enough otherwise so probably not an issue. As a side note, we probably need to track the timers for example when shutting down the process, to avoid having to wait for the remaining timers, although this is more of a minor implementation detail, but adds a bit of extra code.
I think I will keep the current implementation as is, but if you have other concerns or otherwise disagree with some points in the analysis, your comments are much appreciated and I can revisit this if need be!
Yep, very good point.
I wrote this comment before seeing the amount of code and I agree that it's compact/simple enough not to warrant a rewrite.
On complexity, I hadn't thought about it yet. I think that we could also keep the timed-out keys (those that had a recent write and are forbidding any gap from containing them) in a splay tree: this means that checking if a gap {firstKey, lastKey}
is eligible to be in the gapSet is as quick as gapSet.leastGreaterThanOrEqualTo(firstKey) > lastKey
.
This means that for K invalidating operations and G gaps inserted (in the considered time frame):
O(K*log(G))
, plus for each gap, insert it into the splay tree -> O(G*log(G))
O(G*log(K))
, plus for each key, insert it into the splay tree -> O(K*log(K))
Assuming the O constants are similar, it's O((K+G)*log(G))
against O((K+G)*log(K))
so a hedge towards your approach, since K >> G
probably. In terms of memory, it's the same:
O(G)
spaceO(K)
space.
but I'm not sure this is an issue since we are mostly CPU-bound and O(K)
will not be that large (<10k even with a very conservative time frame)So all in all, let's go for your approach :+1:
@fredmnl thanks for your thorough analysis, indeed we could have a splay tree for the incoming keys to speed up the intersection checks before inserting a gap, it should be an easy change.
@fredmnl I liked better the idea that you suggested to use a SortedSet for storing keys for invalidation, because it provides better guarantees on performance (the previous approach would slow down significantly as the number of keys to invalidate at once would increase). So I made the changes in removeOverlappingGaps()
in a new commit on the GapSet PR, and adapted this one as well.
Great, glad I could help! Approving the commit :white_check_mark:
My role is to assist you with the merge of this
pull request. Please type @bert-e help
to get information
on this process, or consult the user documentation.
Status report is not available.
The Fix Version/s
in issue ARSN-391 contains:
7.70.24
Considering where you are trying to merge, I ignored possible hotfix versions and I expected to find:
7.70.24
8.1.123
Please check the Fix Version/s
of ARSN-391, or the target
branch of this pull request.
/approve
A conflict has been raised during the creation of
integration branch w/8.1/feature/ARSN-391-gapCache
with contents from feature/ARSN-391-gapCache
and development/8.1
.
I have not created the integration branch.
Here are the steps to resolve this conflict:
$ git fetch
$ git checkout -B w/8.1/feature/ARSN-391-gapCache origin/development/8.1
$ git merge origin/feature/ARSN-391-gapCache
$ # <intense conflict resolution>
$ git commit
$ git push -u origin w/8.1/feature/ARSN-391-gapCache
The following options are set: approve
The build for commit did not succeed in branch w/8.1/feature/ARSN-391-gapCache.
The following options are set: approve
The changeset has received all authorizations and has been added to the relevant queue(s). The queue(s) will be merged in the target development branch(es) as soon as builds have passed.
The changeset will be merged in:
:heavy_check_mark: development/7.70
:heavy_check_mark: development/8.1
The following branches will NOT be impacted:
development/6.4
development/7.10
development/7.4
There is no action required on your side. You will be notified here once the changeset has been merged. In the unlikely event that the changeset fails permanently on the queue, a member of the admin team will contact you to help resolve the matter.
IMPORTANT
Please do not attempt to modify this pull request.
If you need this pull request to be removed from the queue, please contact a member of the admin team now.
The following options are set: approve
I have successfully merged the changeset of this pull request into targetted development branches:
:heavy_check_mark: development/7.70
:heavy_check_mark: development/8.1
The following branches have NOT changed:
development/6.4
development/7.10
development/7.4
Please check the status of the associated issue ARSN-391.
Goodbye jonathan-gramain.
Introduce a new helper class GapCache that sits on top of a set of GapSet instances, that delays exposure of gaps by a specific time to guarantee atomicity wrt. invalidation from overlapping PUT/DELETE operations.
The way it is implemented is the following:
three update sets are used, each containing a GapSet instance and a series of key update batches:
staging
,frozen
, andexposed
staging
receives the new gaps fromsetGap()
calls and the updates fromremoveOverlappingGaps()
lookupGap()
only returns gaps present inexposed
every
exposureDelayMs
milliseconds, the following happens:the
frozen
gaps get invalidated by all key updates buffered in eitherstaging
orfrozen
update setsthe remainder of the
frozen
gaps is merged intoexposed
(via internal calls toexposed.setGap()
)the
staging
update set becomes the newfrozen
update set (both the gaps and the key updates)a new
staging
update set is instanciated, emptyThis guarantees that any gap set via
setGap()
is only exposed after a minimum ofexposureDelayMs
, and a maximum of twice that time (plus extra needed processing time). Also, keys passed toremoveOverlappingGaps()
are kept in memory for at leastexposureDelayMs
so they can invalidate new gaps that are created in this time frame.This combined with insurance that setGap() is never called after
exposureDelayMs
has passed since the listing process started (from a DB snapshot), guarantees that all gaps not yet exposed have been invalidated by any overlapping PUT/DELETE operation, hence exposed gaps are still valid at the time they are exposed. They may still be invalidated thereafter by future calls to removeOverlappingGaps().The number of gaps that can be cached is bounded by the 'maxGaps' attribute. The current strategy consists of simply not adding new gaps when this limit is reached, solely relying on removeOverlappingGaps() to make room for new gaps. In the future we could consider implementing an eviction mechanism to remove less used gaps and/or with smaller weights, but today the cost vs. benefit of doing this is unclear.