emaxerrno commented 3 years ago

Feature Name: In Memory Writes
Start Date: 2021-10-1
Authors: @rystsov (prd by @senior7515)
Issues: #1836

Executive Summary

We would like to acknowledge the write before syncing to disk, even in the presence of acks=all. Other Kafka-API implementations are subject to data loss by default unless configured for safety explicitly. Redpanda started with a diametrically opposed philosophy - "hardware is really fast, let's make it safe by default". So during an acks=all we only acknowledge the write after calling fsync().

There has been a rise of super low latency usecases in fintech that keep pushing redpanda to new limits. We want to support in-memory writes for all upstream kafka wire protocol acks={none, leader, quorum}

That is, specifically, to not wait for fsync().

What is being proposed

Skip fsync() on writes for the combination of acks={none, leader, quorum}

For some cloud drives the IOPS is the actual bottleneck and give developers more tools to tune for safety (our default) or super low latency - this proposal.

Why (short reason)

Open up low latency use cases, and cloud multi tenancy use cases.

Impact

Enable new super low latency use cases (sustained 1ms or so on non saturated devices). I expect to have a tail of optimizations where where we remove all sorts of 'debouncing' efforts that we do to pipeline batches to disk.

We note explicitly that the long tail optimizations are out of the scope of this proposal.

Motivation

Why are we doing this?

We would like to enable new users onboarded with redpanda that are comfortable with the potentitial for data loss during a coordinated failure, because they have specialized hardware setups, or otherwise understand the tradeoffs. For example, GCP disks are backed by battery powered generators (backup with gasoline) that will ensure that fsyncs happen in 32MB chunks to the underlying storage as an specific example. This drives huge efficiencies for IOPS in cloud disks which is a bottleneck of a system optimizing for data safety when volumes are large.

What use cases does it support?

super Low latency use cases
specialized hardware (i.e.: GCP battery backed disks + 32MB chunks)
expert users with different forms of isolation/redundancy
eliminates the IOPs bottleneck in cloud disks

What is the expected outcome?

Redpanda at memory speeds. More specifically, for a quorum-in-memory write the expectation is a littlebit higher than acks=leader and a lot less than acks=all.

Guide-level explanation

How do we teach this?

We need to explain to programmers that when acknowledging an in-memory write, there is a potential for data loss due to correlated failures even in a replicated setting. We must also say that cloud vendors have optimized for this use case in particular. This content should be driven by engineering with precise technical content so a developer, end user can actually understand it.

Introducing new named concepts.

A new topic-level configuration will be introduced. redpanda.memory.write: true
The configuration will be per topic and not per broker as it is in upstream kafka which makes it impossible to understand as a developer.
We ought to consider making this the default specially in --developer-mode

Reference-level explanation

@rystsov : TODO

Interaction with other features

rpk needs to be able to specify this at topic creation time
kubernetes does not need to worry about this as is entirely in the Kafka-API
global-configuration - we are putting this out of the scope, but would be great to make this a cluster global default when @jcsp 's new configuration object lands in prod
cloud for the shared multi tenant cloud this should be the default, specially for multi-AZ deployments.
rpk must be able to list this property when doing rpk topic describe

Telemetry & Observability

@rystsov: TODO

It considers how to monitor the success and quality of the feature.
- Your RFC must consider and propose a set of metrics to be collected, if applicable, and suggest which metrics would be useful to users and which need to be exposed in a public interface.
- Your RFC should outline how you propose to investigate when users run into related issues in production. If you propose new data structures, suggest how they should be checked for consistency. If you propose new asynchronous subsystems, suggest how a user can observe their state via tracing. In general, think about how your coworkers and users will gain access to the internals of the change after it has happened to either gain understanding during execution or troubleshoot problems.
It is reasonably clear how the feature would be implemented.

Corner cases dissected by example.

@rystsov : TODO

List all corner cases: And a detailed explanation of the corner cases

Detailed design - What needs to change to get there

@rystsov: TODO

The section should return to the examples given in the previous section, and explain more fully how the detailed proposal makes those examples work.

Detailed design - How it works

@rystsov: TODO

Describe the overview of the design, and then explain each part of the implementation in enough detail that reviewers will be able to identify any missing pieces. Make sure to call out interactions with other active RFCs.

Drawbacks

This adds complexity when debugging a production issue. Tooling must be enabled (if not already there) in rpk to be able to actually understand a customer impact.

Rationale and Alternatives

@rystsov: TODO

One simple alternative is to make this option per broker and not modify it per topic. However, this would make it infinitely more complicated to debug a customer impact issue. It would be fragile and latencies hard to track. We note that this is what upstream kafka implementation has.

This section is extremely important. See the README file for details.

Why is this design the best in the space of possible designs?
What other designs have been considered and what is the rationale for not choosing them?
What is the impact of not doing this?

Unresolved questions

@rystsov: to finish

global config - not yet scoped for this work (@jcsp)
telemetry - has not yet been resolved
What parts of the design do you expect to resolve through the RFC process before this gets merged?
What parts of the design do you expect to resolve through the implementation of this feature before stabilization?
What related issues do you consider out of scope for this RFC that could be addressed in the future independently of the solution that comes out of this RFC?

References

Closes #1836

JIRA Link: CORE-748

dotnwat commented 3 years ago

We've discussed in the past a mode in which fsync is a configurable periodic event (e.g. fsync every 10 seconds) that allows a cluster to be tuned by bounding how old lost data might become. If memory acks are implemented in terms of a periodic fsync, then pure memory acks could be configured by choosing a sufficiently large timeout period.

dotnwat commented 3 years ago

super Low latency use cases

I'd love to see some baseline numbers where we comment out fsync and then run with acks=all

rkruze commented 3 years ago

So when I think of acks, I think of acks=all and acks=1 which is separate from this feature request. Would it make sense to call this "Deferred Log Flushing"? I only say this as acks in the Kafka world already means how many brokers are involved when you produce a message, not if they are in memory or not.

emaxerrno commented 3 years ago

So when I think of acks, I think of acks=all and acks=1 which is separate from this feature request. Would it make sense to call this "Deferred Log Flushing"? I only say this as acks in the Kafka world already means how many brokers are involved when you produce a message, not if they are in memory or not.

sure, i reworded to mem writes

emaxerrno commented 3 years ago

super Low latency use cases

I'd love to see some baseline numbers where we comment out fsync and then run with acks=all

that's a really good idea. My hunch is probably tails at the debounce window of 4ms? we maybe able to skip debouncing conextually

rkruze commented 3 years ago

sure, i reworded to mem writes

I know this is a nit but when I think of memory-writes I think of the whole topic being in memory. This is more of a deferred write IMO.

dotnwat commented 3 years ago

super Low latency use cases

I'd love to see some baseline numbers where we comment out fsync and then run with acks=all

that's a really good idea. My hunch is probably tails at the debounce window of 4ms? we maybe able to skip debouncing conextually

two cases:

for the sort of super fast nvme drives we run on today, fsync does not appear to be a bottleneck. it doesn't even appear that dma_write is a bottleneck, but it might be. assuming it is, then we'd need to write into a buffer cache for benefits. assuming its not, then finding and eliminating the bottleneck would probably cause an OOM with background dma_writes and we'd need to apply back pressure. in some sense, all roads lead to needing a buffer cache, but we might be able to realize some benefits before that.
for other setups, i'm really not sure what to expect.

so i think the baselines are really key.

emaxerrno commented 1 year ago

i bet this combined with @travisdowns write coalescing work would make a big difference.

mattschumpert commented 1 year ago

really like the ID of having this topic-wise.

StasiaZam commented 1 year ago

@mattschumpert - could you please summarize the goals for this initiative (around performance and comparison with Kafka)? Eng team needs this to come up with estimations.

redpanda-data / redpanda

In-memory-writes #2547

Executive Summary

What is being proposed

Why (short reason)

Impact

Motivation

Why are we doing this?

What use cases does it support?

What is the expected outcome?

Guide-level explanation

How do we teach this?

Introducing new named concepts.

Reference-level explanation

Interaction with other features

Telemetry & Observability

Corner cases dissected by example.

Detailed design - What needs to change to get there

Detailed design - How it works

Drawbacks

Rationale and Alternatives

Unresolved questions

References