scylladb / scylladb

NoSQL data store using the seastar framework, compatible with Apache Cassandra
http://scylladb.com
GNU Affero General Public License v3.0
13.46k stars 1.27k forks source link

Pre Warm-up cache on boot #14953

Open fee-mendes opened 1 year ago

fee-mendes commented 1 year ago

This is orthogonal to https://github.com/scylladb/scylladb/issues/14927.

With HWLB, we already propagate metrics via gossip to avoid the coordinator from hammering nodes with a cold-cache and hindering down latencies. However, this is not effective if the workload in question uses a CL of ONE (as there is not to load balance, and clients are unaware of HWLB metrics) and - for some workloads (and particularly in bigger machines) - warming up the cache can take a considerable amount of time. Node and access imbalances, for example, further add more uncertainty over the time it may take for a cold node cache to become "warm".

I suggest that we consider improving HWLB (2.0?) in such a way that we readily propagate hot cache content from existing replicas on boot. The flow would seemingly be like this:

  1. Node boots (cold cache)
  2. A given % of RU cache entries from replicas which (1) above is a replica gets pushed to it
  3. Node starts serving now with a warmer (though not necessarily full) cache

This approach allows us to also prevent the following situations:

  1. One runs a rolling upgrade too fast, and effectively destroys its entire cache contents
  2. One runs a rolling upgrade too slow, as it has to "wait" and babysit the node/watch monitoring until its cache gets warmer
  3. Everything happens under the hood, there's no need for any API/nodetool/whatever
mykaul commented 1 year ago

Since it's usually an issue on rolling restarts, @eliransin suggested something like a hibernation of the cache to disk and resuming when restarting.

fee-mendes commented 1 year ago

I am not against a save-to-disk approach, but that would depend on the shutdown (or the hibernate to disk thingy) flow to never hang, otherwise we would just start the node with a cold cache back and hit the situation again. It also assumes that - by the moment we boot up the node - the cache content is still relevant, when it might not be (as the assumption is that this may be effective just for the period of a rolling restart, ie: 10 minutes at worse?).

That said, pulling from other replicas do seem more effective. And in any case either the shutdown (to disk) or startup (to pull) approaches would delay the start-up/shutdown anyway. :)

mykaul commented 1 year ago

Pulling from other replicas will impact other replicas, some are trying to handle the extra load... The assumption is that for a regular rolling restart due to upgrade of Scylla, the cache is still mostly relevant. If not, we are back to the regular HWLB behavior.

fee-mendes commented 1 year ago

I don't see how it impacts other replicas to be quite honest. A replace operation causes much more impact in that regard. Similarly, replaying hints could overwhelm the recently booted node, which are also sent by the replicas, and has impacted latencies severely in the past.

The warm nodes don't have to stream its relevant cached data aggressively as we do for hints, they don't even need to stream 100% of its contents given that the LRU items are likely to be evicted. There's also no notion of "extended outage", as the cold node would be serving requests as a replica (but not as a coordinator) during the duration of the process, which would also help to balance the load and - thus - improve the existing HWLB heuristics.

bhalevy commented 11 months ago

Maybe we can use the speculative reads mechanism to warm-up the cache on restarted nodes. Today we only send them the read as a speculative retry if any of the selected replicas time out, but we can send those reads speculatively, and reply once CL is satisfied (we should just probably wait for the speculative retry or throttle them in some way so not to exahaust resources and/or cause use-after-free on shutdown.)

bhalevy commented 11 months ago

Cc @tgrabiec

avikivity commented 11 months ago

You can't copy cache contents between nodes. There's an invariant that on a given node a cache entry exactly matches the sstables on that node. If you copy the cache contents, but not the sstable data, you violate the invariant. The result can be data resurrection.

This could be recovered by just sending the primary keys and reading the rows, but it would take quite a long time.

nyh commented 11 months ago

This could be recovered by just sending the primary keys and reading the rows, but it would take quite a long time.

It will take time, but maybe not very long time (of course, it's relative)? The key list can be sent in token order, and then the data can be read with a contiguous scan of the sstables. It takes time, sure, but not as long as waiting for random-access reads to retrieve this data, and even more so because we deliberately do fewer of those slow reads (because of HWLB) which normally makes the cache's recovery even slower than full-steam random-access reads.

bhalevy commented 11 months ago

For large partitions it doesn't make sense to prefetch all rows That's why we have a row cache rather than a partition cache.

avikivity commented 11 months ago

For fetching N keys it will take as long as the N cache misses saved would have taken. Of course, we might not save N cache misses but fewer. We might fetch before the replica is up, but the goal is to move out of degraded mode quickly.

Note Cassandra saves its caches on shutdown, but ours are too large.

tzach commented 11 months ago

Note Cassandra saves its caches on shutdown, but ours are too large.

Maybe tiered cached? Scylla can only save a smaller, hot cache. Also, it adds much complexity.

mykaul commented 11 months ago

Note Cassandra saves its caches on shutdown, but ours are too large.

Can we zstd it?

bhalevy commented 11 months ago

Note Cassandra saves its caches on shutdown, but ours are too large.

Can we zstd it?

In any case, if the node has been down for longer than a quick restart the cache contents from the previous incarnation may be irrelevant by the node comes back up. Aside from speculative/opportunistic prefetch, with tablets we can provide a way for a tablet to query the keys in other tablet replica caches to pre-fill its cache after restart. Using the lru order can also help pre-fill first the keys that are most likely to be re-accessed.

nyh commented 11 months ago

For fetching N keys it will take as long as the N cache misses saved would have taken.

This is true in a big-O sense, but I don't think it's accurate if you measure wall-clock, because:

  1. We could read the N keys at 100% utilization, the requests will probably drive us at less than 100% utilization.
  2. We could read the N keys by scanning the sstables. I guess whether this is better than random-access or worse depends on how many keys we need to read.
  3. The server code, CQL, etc., don't need to get involved in reading into the cache. (of course, these requests will happen later, but by that time, the cache will be hot).

I agree that I don't know how impressive the saving will be.