jonathan-gramain commented 9 months ago

Introduce a new helper class GapCache that sits on top of a set of GapSet instances, that delays exposure of gaps by a specific time to guarantee atomicity wrt. invalidation from overlapping PUT/DELETE operations.

The way it is implemented is the following:

three update sets are used, each containing a GapSet instance and a series of key update batches: staging, frozen, and exposed
staging receives the new gaps from setGap() calls and the updates from removeOverlappingGaps()
lookupGap() only returns gaps present in exposed
every exposureDelayMs milliseconds, the following happens:
- the frozen gaps get invalidated by all key updates buffered in either staging or frozen update sets
- the remainder of the frozen gaps is merged into exposed (via internal calls to exposed.setGap())
- the staging update set becomes the new frozen update set (both the gaps and the key updates)
- a new staging update set is instanciated, empty

This guarantees that any gap set via setGap() is only exposed after a minimum of exposureDelayMs, and a maximum of twice that time (plus extra needed processing time). Also, keys passed to removeOverlappingGaps() are kept in memory for at least exposureDelayMs so they can invalidate new gaps that are created in this time frame.

This combined with insurance that setGap() is never called after exposureDelayMs has passed since the listing process started (from a DB snapshot), guarantees that all gaps not yet exposed have been invalidated by any overlapping PUT/DELETE operation, hence exposed gaps are still valid at the time they are exposed. They may still be invalidated thereafter by future calls to removeOverlappingGaps().

The number of gaps that can be cached is bounded by the 'maxGaps' attribute. The current strategy consists of simply not adding new gaps when this limit is reached, solely relying on removeOverlappingGaps() to make room for new gaps. In the future we could consider implementing an eviction mechanism to remove less used gaps and/or with smaller weights, but today the cost vs. benefit of doing this is unclear.

fredmnl commented 9 months ago

Took me a while to grasp the idea. From our conversation on the GapSet PR, I think the only case that we are trying to solve is the one where a gap comes in after the invalidation of a key. That's the only race condition we can have I think. If that's the case, we could simply buffer all of the invalidation for a certain amount of time, and use that to refuse incoming gaps that may have been illegally obtained. That would be a lot simpler but maybe I'm missing the bigger picture here. Am I missing something?

jonathan-gramain commented 9 months ago

From our conversation on the GapSet PR, I think the only case that we are trying to solve is the one where a gap comes in after the invalidation of a key. That's the only race condition we can have I think.

I agree, I also don't see other cases that could cause a race condition. If a gap comes in before the invalidation of a key, we can remove it immediately from the exposed set.

If that's the case, we could simply buffer all of the invalidation for a certain amount of time, and use that to refuse incoming gaps that may have been illegally obtained. That would be a lot simpler but maybe I'm missing the bigger picture here. Am I missing something?

That would work I think, and would be conceptually a bit simpler to understand. Maybe also a bit less code to write but not sure as the current number of lines is already rather low (excluding tests obviously!).

I don't remember exactly my thought process to come up with the solution I have implemented, but I think that one thing I had in mind is that to cope with high-traffic platforms (both in PUT and rate of listings) it would be more efficient to have a staged approach. If I'm not mistaken, if we keep around the operations for a certain amount of time and use them to invalidate incoming gaps immediately (say, in a naive unoptimized implementation at least), we have potentially to check each operation's key against each gap that comes in, so a complexity of "number of candidate listings gaps" x "total number of operations" during the chosen time frame. Obviously the real complexity all depends on the exact values of parameters and the amount of traffic in both listings and operations, but I have the feeling that it is potentially a CPU and performance eater in the worst case scenarios, i.e. exactly when we want to save the CPU as much as possible. Our largest platforms can have on the order of 4K op/s on versioned buckets per raft session (so about 8K DB keys per second, usually stored in batches of up to 50 entries). They can also have multiple bucket listings in parallel at the same time, leading to potentially tens of gaps to create per second if the gap size is relatively short (say, 1000 keys).

The staged approach can optimize this because in the worst case we do this processing only once per period. Sure we have potentially multiple gaps to invalidate at once but since they are stored in a temporary GapSet, they scale better, both because the GapSet uses an expected-cost of O(log_n) data structure and because there are simple optimizations in removeOverlappingGaps() that can leverage their efficiency with more gaps, (especially chained gaps that would lie outside the updated range, which must be a very common case in practice with just a basic listing progressing and creating chained gaps as it goes).

On the solution you suggested, I think node.js performance with thousands of pending timers should be good enough otherwise so probably not an issue. As a side note, we probably need to track the timers for example when shutting down the process, to avoid having to wait for the remaining timers, although this is more of a minor implementation detail, but adds a bit of extra code.

I think I will keep the current implementation as is, but if you have other concerns or otherwise disagree with some points in the analysis, your comments are much appreciated and I can revisit this if need be!

fredmnl commented 9 months ago

Yep, very good point.

I wrote this comment before seeing the amount of code and I agree that it's compact/simple enough not to warrant a rewrite.

On complexity, I hadn't thought about it yet. I think that we could also keep the timed-out keys (those that had a recent write and are forbidding any gap from containing them) in a splay tree: this means that checking if a gap {firstKey, lastKey} is eligible to be in the gapSet is as quick as gapSet.leastGreaterThanOrEqualTo(firstKey) > lastKey.

This means that for K invalidating operations and G gaps inserted (in the considered time frame):

approach A would have: for each key inserted a lookup into a splay tree of O(G) size -> O(K*log(G)), plus for each gap, insert it into the splay tree -> O(G*log(G))
approach B would have: for each gap inserted a lookup into a splay tree of O(K) size -> O(G*log(K)), plus for each key, insert it into the splay tree -> O(K*log(K)) Assuming the O constants are similar, it's O((K+G)*log(G)) against O((K+G)*log(K)) so a hedge towards your approach, since K >> G probably.

In terms of memory, it's the same:

approach A memorizes only the gaps: O(G) space
approach B has to memorize all the keys (the gaps are transfered to the gapset immediately so they can be seen as exterior to this algorithm): O(K) space. but I'm not sure this is an issue since we are mostly CPU-bound and O(K) will not be that large (<10k even with a very conservative time frame)

So all in all, let's go for your approach :+1:

jonathan-gramain commented 9 months ago

@fredmnl thanks for your thorough analysis, indeed we could have a splay tree for the incoming keys to speed up the intersection checks before inserting a gap, it should be an easy change.

jonathan-gramain commented 9 months ago

@fredmnl I liked better the idea that you suggested to use a SortedSet for storing keys for invalidation, because it provides better guarantees on performance (the previous approach would slow down significantly as the number of keys to invalidate at once would increase). So I made the changes in removeOverlappingGaps() in a new commit on the GapSet PR, and adapted this one as well.

fredmnl commented 9 months ago

Great, glad I could help! Approving the commit :white_check_mark:

bert-e commented 9 months ago

Hello jonathan-gramain,

My role is to assist you with the merge of this pull request. Please type @bert-e help to get information on this process, or consult the user documentation.

Status report is not available.

bert-e commented 9 months ago

Incorrect fix version

The Fix Version/s in issue ARSN-391 contains:

7.70.24

Considering where you are trying to merge, I ignored possible hotfix versions and I expected to find:

7.70.24
8.1.123

Please check the Fix Version/s of ARSN-391, or the target branch of this pull request.

jonathan-gramain commented 9 months ago

/approve

bert-e commented 9 months ago

Conflict

A conflict has been raised during the creation of integration branch w/8.1/feature/ARSN-391-gapCache with contents from feature/ARSN-391-gapCache and development/8.1.

I have not created the integration branch.

Here are the steps to resolve this conflict:

 $ git fetch
 $ git checkout -B w/8.1/feature/ARSN-391-gapCache origin/development/8.1
 $ git merge origin/feature/ARSN-391-gapCache
 $ # <intense conflict resolution>
 $ git commit
 $ git push -u origin w/8.1/feature/ARSN-391-gapCache

The following options are set: approve

bert-e commented 9 months ago

Build failed

The build for commit did not succeed in branch w/8.1/feature/ARSN-391-gapCache.

The following options are set: approve

bert-e commented 9 months ago

In the queue

The changeset has received all authorizations and has been added to the relevant queue(s). The queue(s) will be merged in the target development branch(es) as soon as builds have passed.

The changeset will be merged in:

:heavy_check_mark: development/7.70
:heavy_check_mark: development/8.1

The following branches will NOT be impacted:

development/6.4
development/7.10
development/7.4

There is no action required on your side. You will be notified here once the changeset has been merged. In the unlikely event that the changeset fails permanently on the queue, a member of the admin team will contact you to help resolve the matter.

IMPORTANT

Please do not attempt to modify this pull request.

Any commit you add on the source branch will trigger a new cycle after the current queue is merged.
Any commit you add on one of the integration branches will be lost.

If you need this pull request to be removed from the queue, please contact a member of the admin team now.

The following options are set: approve

bert-e commented 9 months ago

I have successfully merged the changeset of this pull request into targetted development branches:

:heavy_check_mark: development/7.70
:heavy_check_mark: development/8.1

The following branches have NOT changed:

development/6.4
development/7.10
development/7.4

Please check the status of the associated issue ARSN-391.

Goodbye jonathan-gramain.

scality / Arsenal

ft: ARSN-391 GapCache: gap caching and invalidation #2213

Hello jonathan-gramain,

Incorrect fix version

Conflict

Build failed

In the queue