Open emaxerrno opened 3 years ago
We've discussed in the past a mode in which fsync is a configurable periodic event (e.g. fsync every 10 seconds) that allows a cluster to be tuned by bounding how old lost data might become. If memory acks are implemented in terms of a periodic fsync, then pure memory acks could be configured by choosing a sufficiently large timeout period.
super Low latency use cases
I'd love to see some baseline numbers where we comment out fsync and then run with acks=all
So when I think of acks, I think of acks=all
and acks=1
which is separate from this feature request. Would it make sense to call this "Deferred Log Flushing"? I only say this as acks
in the Kafka world already means how many brokers are involved when you produce a message, not if they are in memory or not.
So when I think of acks, I think of
acks=all
andacks=1
which is separate from this feature request. Would it make sense to call this "Deferred Log Flushing"? I only say this asacks
in the Kafka world already means how many brokers are involved when you produce a message, not if they are in memory or not.
sure, i reworded to mem writes
super Low latency use cases
I'd love to see some baseline numbers where we comment out fsync and then run with acks=all
that's a really good idea. My hunch is probably tails at the debounce window of 4ms? we maybe able to skip debouncing conextually
sure, i reworded to mem writes
I know this is a nit but when I think of memory-writes I think of the whole topic being in memory. This is more of a deferred write IMO.
super Low latency use cases
I'd love to see some baseline numbers where we comment out fsync and then run with acks=all
that's a really good idea. My hunch is probably tails at the debounce window of 4ms? we maybe able to skip debouncing conextually
two cases:
so i think the baselines are really key.
i bet this combined with @travisdowns write coalescing work would make a big difference.
really like the ID of having this topic-wise.
@mattschumpert - could you please summarize the goals for this initiative (around performance and comparison with Kafka)? Eng team needs this to come up with estimations.
Executive Summary
We would like to acknowledge the write before syncing to disk, even in the presence of
acks=all
. Other Kafka-API implementations are subject to data loss by default unless configured for safety explicitly. Redpanda started with a diametrically opposed philosophy - "hardware is really fast, let's make it safe by default". So during anacks=all
we only acknowledge the write after callingfsync()
.There has been a rise of super low latency usecases in fintech that keep pushing redpanda to new limits. We want to support in-memory writes for all upstream kafka wire protocol
acks={none, leader, quorum}
That is, specifically, to not wait for fsync().
What is being proposed
Skip fsync() on writes for the combination of
acks={none, leader, quorum}
For some cloud drives the IOPS is the actual bottleneck and give developers more tools to tune for safety (our default) or super low latency - this proposal.
Why (short reason)
Open up low latency use cases, and cloud multi tenancy use cases.
Impact
Enable new super low latency use cases (sustained 1ms or so on non saturated devices). I expect to have a tail of optimizations where where we remove all sorts of 'debouncing' efforts that we do to pipeline batches to disk.
We note explicitly that the long tail optimizations are out of the scope of this proposal.
Motivation
Why are we doing this?
We would like to enable new users onboarded with redpanda that are comfortable with the potentitial for data loss during a coordinated failure, because they have specialized hardware setups, or otherwise understand the tradeoffs. For example, GCP disks are backed by battery powered generators (backup with gasoline) that will ensure that fsyncs happen in 32MB chunks to the underlying storage as an specific example. This drives huge efficiencies for IOPS in cloud disks which is a bottleneck of a system optimizing for data safety when volumes are large.
What use cases does it support?
What is the expected outcome?
Redpanda at memory speeds. More specifically, for a quorum-in-memory write the expectation is a littlebit higher than
acks=leader
and a lot less thanacks=all
.Guide-level explanation
How do we teach this?
We need to explain to programmers that when acknowledging an in-memory write, there is a potential for data loss due to correlated failures even in a replicated setting. We must also say that cloud vendors have optimized for this use case in particular. This content should be driven by engineering with precise technical content so a developer, end user can actually understand it.
Introducing new named concepts.
redpanda.memory.write: true
--developer-mode
Reference-level explanation
@rystsov : TODO
Interaction with other features
rpk
needs to be able to specify this at topic creation timekubernetes
does not need to worry about this as is entirely in the Kafka-APIglobal-configuration
- we are putting this out of the scope, but would be great to make this a cluster global default when @jcsp 's new configuration object lands in prodcloud
for the shared multi tenant cloud this should be the default, specially for multi-AZ deployments.rpk
must be able to list this property when doingrpk topic describe
Telemetry & Observability
@rystsov: TODO
Corner cases dissected by example.
@rystsov : TODO
Detailed design - What needs to change to get there
@rystsov: TODO
The section should return to the examples given in the previous section, and explain more fully how the detailed proposal makes those examples work.
Detailed design - How it works
@rystsov: TODO
Describe the overview of the design, and then explain each part of the implementation in enough detail that reviewers will be able to identify any missing pieces. Make sure to call out interactions with other active RFCs.
Drawbacks
This adds complexity when debugging a production issue. Tooling must be enabled (if not already there) in
rpk
to be able to actually understand a customer impact.Rationale and Alternatives
@rystsov: TODO
This section is extremely important. See the README file for details.
Unresolved questions
@rystsov: to finish
global config
- not yet scoped for this work (@jcsp)telemetry
- has not yet been resolvedWhat parts of the design do you expect to resolve through the RFC process before this gets merged?
What parts of the design do you expect to resolve through the implementation of this feature before stabilization?
What related issues do you consider out of scope for this RFC that could be addressed in the future independently of the solution that comes out of this RFC?
References
Closes #1836
JIRA Link: CORE-748