redpanda-data / redpanda

Redpanda is a streaming data platform for developers. Kafka API compatible. 10x faster. No ZooKeeper. No JVM!
https://redpanda.com
9.67k stars 589 forks source link

In-memory-writes #2547

Open emaxerrno opened 3 years ago

emaxerrno commented 3 years ago

Executive Summary

We would like to acknowledge the write before syncing to disk, even in the presence of acks=all. Other Kafka-API implementations are subject to data loss by default unless configured for safety explicitly. Redpanda started with a diametrically opposed philosophy - "hardware is really fast, let's make it safe by default". So during an acks=all we only acknowledge the write after calling fsync().

There has been a rise of super low latency usecases in fintech that keep pushing redpanda to new limits. We want to support in-memory writes for all upstream kafka wire protocol acks={none, leader, quorum}

That is, specifically, to not wait for fsync().

What is being proposed

Skip fsync() on writes for the combination of acks={none, leader, quorum}

For some cloud drives the IOPS is the actual bottleneck and give developers more tools to tune for safety (our default) or super low latency - this proposal.

Why (short reason)

Open up low latency use cases, and cloud multi tenancy use cases.

Impact

Enable new super low latency use cases (sustained 1ms or so on non saturated devices). I expect to have a tail of optimizations where where we remove all sorts of 'debouncing' efforts that we do to pipeline batches to disk.

We note explicitly that the long tail optimizations are out of the scope of this proposal.

Motivation

Why are we doing this?

We would like to enable new users onboarded with redpanda that are comfortable with the potentitial for data loss during a coordinated failure, because they have specialized hardware setups, or otherwise understand the tradeoffs. For example, GCP disks are backed by battery powered generators (backup with gasoline) that will ensure that fsyncs happen in 32MB chunks to the underlying storage as an specific example. This drives huge efficiencies for IOPS in cloud disks which is a bottleneck of a system optimizing for data safety when volumes are large.

What use cases does it support?

What is the expected outcome?

Redpanda at memory speeds. More specifically, for a quorum-in-memory write the expectation is a littlebit higher than acks=leader and a lot less than acks=all.

Guide-level explanation

How do we teach this?

We need to explain to programmers that when acknowledging an in-memory write, there is a potential for data loss due to correlated failures even in a replicated setting. We must also say that cloud vendors have optimized for this use case in particular. This content should be driven by engineering with precise technical content so a developer, end user can actually understand it.

Introducing new named concepts.

Reference-level explanation

@rystsov : TODO

Interaction with other features

Telemetry & Observability

@rystsov: TODO

Corner cases dissected by example.

@rystsov : TODO

Detailed design - What needs to change to get there

@rystsov: TODO

The section should return to the examples given in the previous section, and explain more fully how the detailed proposal makes those examples work.

Detailed design - How it works

@rystsov: TODO

Describe the overview of the design, and then explain each part of the implementation in enough detail that reviewers will be able to identify any missing pieces. Make sure to call out interactions with other active RFCs.

Drawbacks

This adds complexity when debugging a production issue. Tooling must be enabled (if not already there) in rpk to be able to actually understand a customer impact.

Rationale and Alternatives

@rystsov: TODO

This section is extremely important. See the README file for details.

Unresolved questions

@rystsov: to finish

References

Closes #1836

JIRA Link: CORE-748

dotnwat commented 3 years ago

We've discussed in the past a mode in which fsync is a configurable periodic event (e.g. fsync every 10 seconds) that allows a cluster to be tuned by bounding how old lost data might become. If memory acks are implemented in terms of a periodic fsync, then pure memory acks could be configured by choosing a sufficiently large timeout period.

dotnwat commented 3 years ago

super Low latency use cases

I'd love to see some baseline numbers where we comment out fsync and then run with acks=all

rkruze commented 3 years ago

So when I think of acks, I think of acks=all and acks=1 which is separate from this feature request. Would it make sense to call this "Deferred Log Flushing"? I only say this as acks in the Kafka world already means how many brokers are involved when you produce a message, not if they are in memory or not.

emaxerrno commented 3 years ago

So when I think of acks, I think of acks=all and acks=1 which is separate from this feature request. Would it make sense to call this "Deferred Log Flushing"? I only say this as acks in the Kafka world already means how many brokers are involved when you produce a message, not if they are in memory or not.

sure, i reworded to mem writes

emaxerrno commented 3 years ago

super Low latency use cases

I'd love to see some baseline numbers where we comment out fsync and then run with acks=all

that's a really good idea. My hunch is probably tails at the debounce window of 4ms? we maybe able to skip debouncing conextually

rkruze commented 3 years ago

sure, i reworded to mem writes

I know this is a nit but when I think of memory-writes I think of the whole topic being in memory. This is more of a deferred write IMO.

dotnwat commented 3 years ago

super Low latency use cases

I'd love to see some baseline numbers where we comment out fsync and then run with acks=all

that's a really good idea. My hunch is probably tails at the debounce window of 4ms? we maybe able to skip debouncing conextually

two cases:

so i think the baselines are really key.

emaxerrno commented 1 year ago

i bet this combined with @travisdowns write coalescing work would make a big difference.

mattschumpert commented 1 year ago

really like the ID of having this topic-wise.

StasiaZam commented 1 year ago

@mattschumpert - could you please summarize the goals for this initiative (around performance and comparison with Kafka)? Eng team needs this to come up with estimations.