Is your feature request related to a problem? Please describe.
Static replication can consume too much memory.
Currently CNR/NR set-up the replication factor at data-structure instantiation and doesn't allow it to change which is sub-optimal: as the workload changes a program might want to invest less memory in replication (at the cost of read-performance) but instead use the memory for something else.
Describe the solution you'd like
Varying replicas
[random thoughts in no-particular order of how this should be implemented]
Remove replica
Get the combiner lock on the replica (no more readers)
Make sure any future operations from threads registered with the replica we remove are going to another replica from now on
This might need an indirection table in the NodeReplicated<> struct
We also need to make sure that the Replica<> struct uses the right Context<> since the thread id that context uses are “replica-local”
Maybe the right way to do this is to have all Context allocated in the NodeReplicated<> struct in a flat array, then hand an iterator<Context> to the replicas which yields the right contexts based on current thread assignment?
Alternative: think of it as as thread re-registration (addition, removal of thread to replica)
“Deregister” the replica from the log
Deallocate the data-structure
Maybe: Drop the Replica struct itself (seems harder as there can be reader threads accessing its members simultaneously); maybe it’s actually easier to keep the replica struct if we ever want to add it again?
This would mean we need to know the max. amount of replicas upfront; easy to just set to #nodes (doesn’t make sense to have more in most cases)…
For example a thread might notice that replica is behind while we’re in the process of dropping it and call sync on it (sync will try to acquire the combiner lock if it happens after replica is dropped, should probably just return)
There might be outstanding operations in the context buffer of the threads (I think this is fine we just keep them there and apply them on the new replica we’re redirected to)
We’ll have cases where a thread is no longer local to the NUMA node but it might become the combiner and do allocation on behalf of the data-structure. Ideally these allocations should be NUMA local
We can use the existing AffinityChange interface to notify when threads have to change affinity
As an optimization: if only one replica is left, we can avoid using the log until we have a replica again
Even more optimization: if only one replica is left with one thread registered, we can just route execute and execute_mut to Dispatch directly, no need to FC or use the log…
Nice because then we’re back to single-threaded DS with no atomics in the path
Add a new replica:
We need to get a quiescent copy of the DS of some existing replica in the system
Should be fine if we hold the combiner lock
We need to make sure we remember where in the log this replica is at when we do the copy
We need to make sure we copy to memory of the right NUMA affinity
We already require Clone on the DS
Copying the DS might be very expensive (which is fine as a first
One option: While copying keep a single DS and share it between two replicas
Maybe: Log operations again between the two shared replicas
We could use copy-on-write for the data-structure (e.g., Cow<> in rust)
Might be copy on read instead too (lazy replication)?
Harder, if not impossible because we don’t have a mutable reference on reads
We need to register the replica with the log
And set our ctail to the same index of the replica we’ve copied from
We should also ensure no wrap-around in the log meanwhile / e.g., once we set the ctail the log needs to wait for the replica we’re in process of adding (and respect its ctail when deciding when it’s safe to overwrite log entries)
Dynamic membership (removing and adding replicas) with the log is a bit tricky since the log is not designed for this
It has a single atomic next field that tracks registration
The “next” field is used to check for to find the minimum completed tail (by iterating from 0..next)
Such functions now need an additional check if the replica is active or not
Maybe keep an atomic bitmask in addition to next?
Need to figure out when it’s safe to change this mask as almost all log functions are fully concurrent and code needs to vetted for new assumption that a replica might disappear/appear at any time
We need to route certain threads to the new replica
One question is do we allow registration of threads dynamically too or do keep as is where we expect all threads to register up-front
I think we can expect no new threads, then it’s just a matter of deciding which ones to re-route to the new replica
Again need to make sure the replica has the right thread contexts
API pseudo-code: adding and removing replicas
NodeReplicated {
max_replicas: 4 // #NUMA nodes
active: Vec<usize> // currently active replicas
routing: HashMap<tid, rid> // maps threads to replicas
}
fn add_replica(rid: usize) -> Result {
assert(rid < max_replicas);
assert(!active.contains(rid), “not already active”);
// possible thread routing policy:
// migrates tids that were originally registered to `rid` back to execute against `rid`
ensures(active.contains(rid));
}
fn remove_replica(rid: usize, new_tid_affinity: usize) {
assert(rid < max_replicas);
assert(new_tid_affinity < max_replicas);
assert(active.contains(rid));
assert(active.contains(new_tid_affinity));
// route all threads registered with rid to `new_tid_affinity`
// should it be possible to do more complex assignment?
// then maybe we can provide a function that implements a routing
// policy and assigns them or provide a HashMap with tid->new_rid mappings
// and a default?
ensures(!active.contains(rid));
ensures(active.contains(new_tid_affinity));
ensures(!routing.values.contains(rid));
ensures(routing.len == old(routing).len);
}
How do we show it works
NR micro-benchmarks: measure mem consumption and read-tput, vary replicas over time
Integration with network and IO
(this is mostly future work to connect IO + network once we have these mechanism in place)
Socket for each replica:
Spawn up a new receive path each time to add a new replica
Route request to right replica
Similar framework to NR for network
Seastar: shared nothing, dpdk queue per-core
Similar but optimize for request ingress
Shuffle from IO to NR
Becomes a bottleneck if NIC is not smart about routing
FLOEM/FlexNIC from antoine
Describe alternatives you've considered
Alternative approaches
We could stop the world, delete the whole NR instance, extract the DS from the replicas and reconstruct a new NR instance
Is your feature request related to a problem? Please describe.
Static replication can consume too much memory.
Currently CNR/NR set-up the replication factor at data-structure instantiation and doesn't allow it to change which is sub-optimal: as the workload changes a program might want to invest less memory in replication (at the cost of read-performance) but instead use the memory for something else.
Describe the solution you'd like
Varying replicas
[random thoughts in no-particular order of how this should be implemented]
Remove replica
Add a new replica:
API pseudo-code: adding and removing replicas
How do we show it works
NR micro-benchmarks: measure mem consumption and read-tput, vary replicas over time
Integration with network and IO
(this is mostly future work to connect IO + network once we have these mechanism in place)
Describe alternatives you've considered
Alternative approaches
Additional context
No response