ability to produce stalls

shribe commented 3 years ago

Just running benchmarks, against a cluster of 3 replicas, in kubernetes, with all other settings the same as your single replica example. After a while of running at 600K-1.4M rps (depending on which k8s cluster), we usually get a run of 0rps reported, then it picks back up. After some small number of iterations of this, it stalls permanently. (Quitting and restarting client does not help.)

We have noticed that the redpanda containers are restarting, and the stalls to 0rps happen around the time of the restarts. The logs around the container quits seem to reflect memory allocation activity. It seems that if I increase memory allocation, it takes substantially longer for this to happen.

It seems that if a cluster is run with only 1 replica, this does not happen (of course it's not definitive, but my single-node test has been running for almost 3 hours, much longer than a 3-replica test has ever lasted).

The containers are not running out of disk.

When it happens, the test application from franz-go/examples/bench run at debug log level produces the following output:

root@sangrenel:/franz-go/examples/bench# ./bench -topic test-2 -brokers redpanda.pg.svc.cluster.local:9092 -pool -log-level debug
[INFO] waiting for metadata to produce to unknown topic; topic: test-2
[DEBUG] opening connection to broker; addr: redpanda.pg.svc.cluster.local:9092, broker: -2147483648
[DEBUG] connection opened to broker; addr: redpanda.pg.svc.cluster.local:9092, broker: -2147483648
[DEBUG] issuing api versions request; broker: -2147483648, version: 3
[DEBUG] wrote ApiVersions v3; broker: -2147483648, bytes_written: 29, write_wait: 106.09µs, time_to_write: 40.355µs, err: <nil>
[DEBUG] read ApiVersions v3; broker: -2147483648, bytes_read: 204, read_wait: 33.957µs, time_to_read: 105.037µs, err: <nil>
[DEBUG] connection initialized successfully; addr: redpanda.pg.svc.cluster.local:9092, broker: -2147483648
[DEBUG] wrote Metadata v7; broker: -2147483648, bytes_written: 30, write_wait: 3.716917ms, time_to_write: 40.777µs, err: <nil>
[DEBUG] read Metadata v7; broker: -2147483648, bytes_read: 250, read_wait: 30.869µs, time_to_read: 71.024µs, err: <nil>
[INFO] done waiting for unknown topic; topic: test-2
[INFO] initializing producer id
[DEBUG] opening connection to broker; addr: redpanda-1.redpanda.pg.svc.cluster.local.:9092, broker: 1
[DEBUG] connection opened to broker; addr: redpanda-1.redpanda.pg.svc.cluster.local.:9092, broker: 1
[DEBUG] issuing api versions request; broker: 1, version: 3
[DEBUG] wrote ApiVersions v3; broker: 1, bytes_written: 29, write_wait: 15.189µs, time_to_write: 29.839µs, err: <nil>
[DEBUG] read ApiVersions v3; broker: 1, bytes_read: 204, read_wait: 19.542µs, time_to_read: 65.704µs, err: <nil>
[DEBUG] connection initialized successfully; addr: redpanda-1.redpanda.pg.svc.cluster.local.:9092, broker: 1
[DEBUG] wrote InitProducerID v1; broker: 1, bytes_written: 23, write_wait: 911.826µs, time_to_write: 31.667µs, err: <nil>
[DEBUG] read InitProducerID v1; broker: 1, bytes_read: 24, read_wait: 16.863µs, time_to_read: 57.736µs, err: <nil>
[INFO] producer id initialization success; id: 1010, epoch: 0
[DEBUG] opening connection to broker; addr: redpanda-2.redpanda.pg.svc.cluster.local.:9092, broker: 2
[DEBUG] connection opened to broker; addr: redpanda-2.redpanda.pg.svc.cluster.local.:9092, broker: 2
[DEBUG] issuing api versions request; broker: 2, version: 3
[DEBUG] wrote ApiVersions v3; broker: 2, bytes_written: 29, write_wait: 19.675µs, time_to_write: 33.346µs, err: <nil>
[DEBUG] read ApiVersions v3; broker: 2, bytes_read: 204, read_wait: 21.327µs, time_to_read: 71.379µs, err: <nil>
[DEBUG] connection initialized successfully; addr: redpanda-2.redpanda.pg.svc.cluster.local.:9092, broker: 2
[DEBUG] wrote Produce v7; broker: 2, bytes_written: 106417, write_wait: 5.882046ms, time_to_write: 64.881µs, err: <nil>
0.00 MiB/s; 0.00k records/s
0.00 MiB/s; 0.00k records/s
[DEBUG] read Produce v7; broker: 2, bytes_read: 58, read_wait: 49.73µs, time_to_read: 2.002433646s, err: <nil>
[DEBUG] produced; broker: 2, to: test-2[0{retrying@-1,7246(NOT_LEADER_FOR_PARTITION: This server is not the leader for that topic-partition.)}]
0.00 MiB/s; 0.00k records/s
0.00 MiB/s; 0.00k records/s
0.00 MiB/s; 0.00k records/s
0.00 MiB/s; 0.00k records/s
0.00 MiB/s; 0.00k records/s
0.00 MiB/s; 0.00k records/s
0.00 MiB/s; 0.00k records/s
0.00 MiB/s; 0.00k records/s
[DEBUG] opening connection to broker; addr: redpanda-2.redpanda.pg.svc.cluster.local.:9092, broker: 2
[DEBUG] connection opened to broker; addr: redpanda-2.redpanda.pg.svc.cluster.local.:9092, broker: 2
[DEBUG] issuing api versions request; broker: 2, version: 3
[DEBUG] wrote ApiVersions v3; broker: 2, bytes_written: 29, write_wait: 27.284µs, time_to_write: 26.614µs, err: <nil>
[DEBUG] read ApiVersions v3; broker: 2, bytes_read: 204, read_wait: 23.959µs, time_to_read: 92.338µs, err: <nil>
[DEBUG] connection initialized successfully; addr: redpanda-2.redpanda.pg.svc.cluster.local.:9092, broker: 2
[DEBUG] wrote Metadata v7; broker: 2, bytes_written: 30, write_wait: 1.168366ms, time_to_write: 37.4µs, err: <nil>
[DEBUG] read Metadata v7; broker: 2, bytes_read: 250, read_wait: 30.252µs, time_to_read: 83.475µs, err: <nil>
[DEBUG] wrote Produce v7; broker: 2, bytes_written: 106417, write_wait: 3.544975ms, time_to_write: 71.164µs, err: <nil>
[DEBUG] wrote Produce v7; broker: 2, bytes_written: 104893, write_wait: 4.452ms, time_to_write: 50.122µs, err: <nil>
[DEBUG] wrote Produce v7; broker: 2, bytes_written: 104981, write_wait: 5.295396ms, time_to_write: 36.669µs, err: <nil>
[DEBUG] wrote Produce v7; broker: 2, bytes_written: 105036, write_wait: 6.10261ms, time_to_write: 35.34µs, err: <nil>
0.00 MiB/s; 0.00k records/s
0.00 MiB/s; 0.00k records/s
[DEBUG] read Produce v7; broker: 2, bytes_read: 58, read_wait: 105.211µs, time_to_read: 1.997693122s, err: <nil>
[DEBUG] produced; broker: 2, to: test-2[0{retrying@-1,7246(NOT_LEADER_FOR_PARTITION: This server is not the leader for that topic-partition.)}]
0.00 MiB/s; 0.00k records/s
0.00 MiB/s; 0.00k records/s

A check with rpk shows this:

redpanda@redpanda-0:/$ rpk cluster info redpanda --brokers redpanda.pg.svc.cluster.local:9092
  Redpanda Cluster Info

  0 (redpanda-0.redpanda.pg.svc.cluster.local.:9092)  Leader:

                                                      Replica: test-2: [0]

  1 (redpanda-1.redpanda.pg.svc.cluster.local.:9092)  Leader:

                                                      Replica: test-2: [0]

  2 (redpanda-2.redpanda.pg.svc.cluster.local.:9092)  Leader: test-2: [0]

                                                      Replica:

root@sangrenel:/franz-go/examples/bench# rpk topic produce test-2 --brokers redpanda.pg.svc.cluster.local:9092
Reading message... Press CTRL + D to send, CTRL + C to cancel.
foobar
Sent record to partition 0 at offset 5306104228 with timestamp 2021-06-25 20:25:30.576639673 +0000 UTC m=+4.591610908.

It looks kind of like maybe the container restarts shuffled the leader around, but something didn't get updated with the new leader info. (At this point, all the containers have multiple restarts: 2, 3, 3.) In fact, taking a look at the still-running single-node one, it has 4 restarts, but the bench utility has never reported less than 500Krps--if the problem is propagating new leader on crash, that wouldn't affect a single node cluster, right?

I've attached redpanda logs; the numbers 0-2 are the node numbers; "a" is the prior log of the container, in other words the "crashed" one; "b" is the log of the currently running one when the above was happening.

0a.gz 0b.gz 1a.gz 1b.gz 2a.gz 2b.gz

shribe commented 3 years ago

Update: at about 3.5 hours, the producer to the single-node cluster quit as follows:

66.44 MiB/s; 696.64k records/s
67.77 MiB/s; 710.57k records/s
53.30 MiB/s; 558.87k records/s
0.00 MiB/s; 0.00k records/s
0.00 MiB/s; 0.00k records/s
0.00 MiB/s; 0.00k records/s
0.00 MiB/s; 0.00k records/s
0.00 MiB/s; 0.00k records/s
0.00 MiB/s; 0.00k records/s
0.00 MiB/s; 0.00k records/s
0.00 MiB/s; 0.00k records/s
0.00 MiB/s; 0.00k records/s
90.33 MiB/s; 947.22k records/s
71.89 MiB/s; 753.79k records/s
53.70 MiB/s; 563.08k records/s
26.85 MiB/s; 281.50k records/s
0.00 MiB/s; 0.00k records/s
0.00 MiB/s; 0.00k records/s
0.00 MiB/s; 0.00k records/s
304.17 MiB/s; 3189.45k records/s
142.15 MiB/s; 1490.52k records/s
101.49 MiB/s; 1064.16k records/s
produce error: EOF

emaxerrno commented 3 years ago

@shribe what redpanda version?

To me looks like something is probably killing the pods. are you running them in non-dev mode? and what mem/cpu... can you add more context so we can debug.

version
cpu to mem ratio (2GB+/core for bench is needed) + 10% for system stuff make sure the pod doesn't get killed
did you tune the kernel / etc to not pay a block-io copy ? (rpk mode production & rpk tune all) - otherwise you leave large perf gains
what does fio give you w/ a 10G file seq-write-read

twmb commented 3 years ago

w/ the benchmark, when kgo receives NOT_LEADER_FOR_PARTITION, it sets the partition to "failed" and triggers metadata to be updated. While the partition is failed, the client does not try producing that partition. A metadata update always clears the failure state (that's why we see the producing start back up after the "read metadata" log). However, the client also ensures that metadata updates are not spammed, so for this particular failure, the client ensures that at least MetadataMinAge has passed before re-requesting metadata -- that's likely why we see things wait for a bit before the metadata request is issued.

We can see that the metadata update did not change the leader broker: the logs before the first NOT_LEADER_FOR_PARTITION show things producing to broker 2, the logs after the partition show the same. rpk agrees with this, the output shows the leader is broker 2.

The EOF at the end is due to Redpanda closing the connection on us unexpectedly. Currently that is not a retriable error. I'm not sure it is worth it to make that particular error retriable -- there's not much of a reason for a broker to close a connection randomly when we are reading. This likely can be a retriable error but I'll have to think through the ramifications.

shribe commented 3 years ago

To me looks like something is probably killing the pods. are you running them in non-dev mode? and what mem/cpu... can you add more context so we can debug.

Sorry, I switched dev mode off, then at some point accidentally reverted to yaml with dev mode on, so I don't know which tests were on vs off.

version

21.6.4

cpu to mem ratio (2GB+/core for bench is needed) + 10% for system stuff make sure the pod doesn't get killed

I started with your example, which had 1.2GB per CPU, which did this pretty quickly. I bumped it up to 2GB/CPU, which resulted in it taking a long time. Based on what you said above I went up to 3GB/CPU, and it's now been running 10 hours straight without stalling. But I do see multiple restarts on each pod, the logs of the dead containers end like this:

DEBUG 2021-06-29 00:56:58,115 [shard 0] compaction_ctrl - backlog_controller.cc:128 - updating shares 10
6/28/2021 6:57:11 PM DEBUG 2021-06-29 00:57:11,126 [shard 0] storage - readers_cache.cc:199 - {kafka/test-1/0} - evicting reader from cache, range [19184498999,19217865500]
6/28/2021 6:57:11 PM INFO  2021-06-29 00:57:11,126 [shard 0] storage - segment.cc:626 - Creating new segment /var/lib/redpanda/data/kafka/test-1/0_8/19217865501-5-v1.log
6/28/2021 6:57:18 PM DEBUG 2021-06-29 00:57:18,458 [shard 0] rpc - server.cc:91 - kafka rpc protocol Closing: 127.0.0.1:42356
6/28/2021 6:57:18 PM DEBUG 2021-06-29 00:57:18,459 [shard 0] rpc - server.cc:91 - kafka rpc protocol Closing: 10.250.5.172:34366
6/28/2021 6:57:18 PM ERROR 2021-06-29 00:57:18,514 [shard 0] assert - Assert failure: (../../../src/v/storage/segment_appender.cc:494) 'false' Could not dma_write: std::__1::system_error (error system:12, Cannot allocate memory) - {no_of_chunks:64, closed:0, fallocation_offset:67108864, committed_offset:55831505, bytes_flush_pending:0}
6/28/2021 6:57:18 PM ERROR 2021-06-29 00:57:18,514 [shard 0] assert - Backtrace below:
6/28/2021 6:57:18 PM 0x2af2934 0x2310676 0x231062f 0x2310605 0x23108b9 0x2870164 0x2872dd7 0x279c0e9 0x279a140 0xd652a5 0x2d910ec /opt/redpanda/lib/libc.so.6+0x281a1 0xd6330d
6/28/2021 6:57:18 PM    --------
6/28/2021 6:57:18 PM    seastar::continuation<seastar::internal::promise_base_with_type<void>, seastar::future<void> seastar::future<void>::handle_exception<storage::segment_appender::dispatch_background_head_write()::$_21>(storage::segment_appender::dispatch_background_head_write()::$_21&&)::'lambda'(storage::segment_appender::dispatch_background_head_write()::$_21&&), seastar::futurize<storage::segment_appender::dispatch_background_head_write()::$_21>::type seastar::future<void>::then_wrapped_nrvo<seastar::future<void>, seastar::future<void> seastar::future<void>::handle_exception<storage::segment_appender::dispatch_background_head_write()::$_21>(storage::segment_appender::dispatch_background_head_write()::$_21&&)::'lambda'(storage::segment_appender::dispatch_background_head_write()::$_21&&)>(seastar::future<void> seastar::future<void>::handle_exception<storage::segment_appender::dispatch_background_head_write()::$_21>(storage::segment_appender::dispatch_background_head_write()::$_21&&)::'lambda'(storage::segment_appender::dispatch_background_head_write()::$_21&&)&&)::'lambda'(seastar::internal::promise_base_with_type<void>&&, seastar::future<void> seastar::future<void>::handle_exception<storage::segment_appender::dispatch_background_head_write()::$_21>(storage::segment_appender::dispatch_background_head_write()::$_21&&)::'lambda'(storage::segment_appender::dispatch_background_head_write()::$_21&&)&, seastar::future_state<seastar::internal::monostate>&&), void>

did you tune the kernel / etc to not pay a block-io copy ? (rpk mode production & rpk tune all) - otherwise you leave large perf gains

No--I don't have access to do that in this cluster. (I could in another cluster, and might out of curiosity, but the performance we are seeing untuned is already well above what we need. What we need to see is stability.)

what does fio give you w/ a 10G file seq-write-read

==================
= Dbench Summary =
==================
Random Read/Write IOPS: 398k/66.4k. BW: 5481MiB/s / 2656MiB/s
Average Latency (usec) Read/Write: 99.49/70.91
Sequential Read/Write: 6424MiB/s / 3080MiB/s
Mixed Random Read/Write IOPS: 135k/44.8k

shribe commented 3 years ago

w/ the benchmark, when kgo receives NOT_LEADER_FOR_PARTITION, it sets the partition to "failed" and triggers metadata to be updated. While the partition is failed, the client does not try producing that partition. A metadata update always clears the failure state (that's why we see the producing start back up after the "read metadata" log). However, the client also ensures that metadata updates are not spammed, so for this particular failure, the client ensures that at least MetadataMinAge has passed before re-requesting metadata -- that's likely why we see things wait for a bit before the metadata request is issued.

That sounds like exactly what I observe when it stalls briefly then resumes producing. However, after a few such stalls, the failure appears to be permanent. I once left it for a full hour, and it logged 0 records/s continuously until I killed it.

We can see that the metadata update did not change the leader broker: the logs before the first NOT_LEADER_FOR_PARTITION show things producing to broker 2, the logs after the partition show the same. rpk agrees with this, the output shows the leader is broker 2.

OK, I had assumed the leader was changing, but it must be some other reason that it wedges.

The EOF at the end is due to Redpanda closing the connection on us unexpectedly. Currently that is not a retriable error. I'm not sure it is worth it to make that particular error retriable -- there's not much of a reason for a broker to close a connection randomly when we are reading. This likely can be a retriable error but I'll have to think through the ramifications.

I'm more concerned about redpanda closing the connection unexpectedly than about how the client side handles it.

twmb commented 3 years ago

However, after a few such stalls, the failure appears to be permanent

I wonder if this is a client error or if redpanda has entered a fatal state. I can see about potentially reproducing this locally, but @dotnwat or @mmaslankaprv it's likely worth checking out those redpanda logs above^

twmb commented 3 years ago

@shribe is it possible to benchmark with another utility -- say, librdkafka's kafkardkafka_performance.c? (I think @dotnwat may have an incantation you can use here). I'm curious if this is able to get (1) a similar throughput, and (2) if it hits an unrecoverable point.

shribe commented 3 years ago

I wonder if this is a client error or if redpanda has entered a fatal state.

Quitting & restarting the client doesn't help, so redpanda is in a fatal state.

mmaslankaprv commented 3 years ago

@shribe, thank you for detailed description, I will try to reproduce this issue on our side and will reach out to you as soon as I will have some more info.

shribe commented 3 years ago

So, running with 3GB/CPU, it managed 22 hours straight without stalling once. (The test was terminated by my VPN max session timeout, NOT by redpanda failure.) Over that 22 hours I saw a modest decline in throughput--perfectly reasonable for index(es) depth increase when pushing a few 10s of billions of rows into a single partition.

From the time that I noted above that at ~10 hours there had been a few restarts (2, 3, 4), during the next ~12 hours there were no additional restarts.

The restarts by themselves are not a critical issue, but right now I worry that there might be a fatal condition triggered by multiple restarts, perhaps multiple restarts nearly simultaneous--such that increasing memory isn't actually a 100% fix, but just makes it far less likely (since with higher memory I still see a few restarts).

I will probably be able to give rdkafka_performance a try later today, and will run with develop mode off. Heck I might give it a go with ZFS volumes as well.

twmb commented 3 years ago

Awesome, thank you! And from the client side, please let me know if rdkafka_performance has higher throughput or lower :)

dotnwat commented 3 years ago

6/28/2021 6:57:18 PM DEBUG 2021-06-29 00:57:18,459 [shard 0] rpc - server.cc:91 - kafka rpc protocol Closing: 10.250.5.172:34366 6/28/2021 6:57:18 PM ERROR 2021-06-29 00:57:18,514 [shard 0] assert - Assert failure: (../../../src/v/storage/segment_appender.cc:494) 'false' Could not dma_write: std::__1::system_error (error system:12, Cannot allocate memory) - {no_of_chunks:64, closed:0, fallocation_offset:67108864, committed_offset:55831505, bytes_flush_pending:0}

@mmaslankaprv this might be because we aren't applying any back pressure in the segment appender

shribe commented 3 years ago

Awesome, thank you! And from the client side, please let me know if rdkafka_performance has higher throughput or lower :)

Very roughly, rdkafka seems to produce at ~30% faster. (All tests run with 100-byte payloads.)

shribe commented 3 years ago

(And I won't be testing on ZFS, because of RBAC/privs issues in that cluster and it's not important enough to me to dig and figure out the fix. Probably not hard really--just super-low importance.)

dotnwat commented 3 years ago

@shribe quick question we'll try to reproduce this locally: you've got three replicas. how many partitions are in the topic you are producing to, and what is the acks setting (or maybe its the default for that bench program).

shribe commented 3 years ago

1 partition, ack settings are left at defaults

shribe commented 3 years ago

something really curious: when I run the same tests on a test cluster with a Weka CSI for storage, the rdkafka performance is basically the same, but bench performance about doubles (and is thus 40% faster than rdkafka)

different hardware & different network in addition to the different storage, so many variables...

twmb commented 3 years ago

What are the flags you are invoking rdkafka_performance with, as well as the flags you're invoking franz-go's bench with?

shribe commented 3 years ago

./bench -topic test-1 -brokers redpanda-4.redpanda-dev.svc.cluster.local:9092 -pool

./rdkafka_performance -P -t test-2 -b redpanda-4.redpanda-dev.svc.cluster.local:9092 -s 100 -i 1000

twmb commented 3 years ago

Awesome thanks, so there are a few differences out of the box between the two benchmarking utilities. librdkafka does not use compression by default, it uses a static (non-variable) message, and it uses 1s of lingering.

I just pushed one commit to the benchmarking utility, so if you want a like-for-like comparison against librdkafka's rdkafka_performance, try the following flags: -linger 1s -compression none -static-record. Alternatively, drop the compression flag and add -z snappy to rdkafka_performance.

shribe commented 3 years ago

Really quick checks: the linger speeds it up a bit, static a tiny bit, turning compression off slows it down (which points to network, I think)

isaacsanders commented 3 years ago

Is there more to come on this issue?

emaxerrno commented 3 years ago

hi @isaacsanders - did you hit the same issue? or what are you looking to learn. this is running on a non xfs filesystem, so we haven't done as much testing on zfs or other filesystems that are not xfs. so we haven't prioritized yet, but we should get to this issue soon, hopefully.

isaacsanders commented 3 years ago

@senior7515 I work with @shribe and I couldn’t determine the state of the issue. There’s a lot of information shared here, but I can’t see if there’s an action item left. Are you all still working to reproduce the issue, or did we get a different answer with the new system?

emaxerrno commented 3 years ago

@isaacsanders - makes sense. so we tried on xfs - regular filesystem and couldn't repro, but we want to get to the bottom of it, we just haven't had a chance yet.

@mmaslankaprv & @dotnwat - how can @shribe and @isaacsanders help us debug the issue?

shribe commented 3 years ago

this is running on a non xfs filesystem

Clarification: tests were primarily run on XFS, and problem was encountered there. (Tests on Weka were run because we are currently evaluating Weka for many uses. Tests on ZFS were not run.)

emaxerrno commented 3 years ago

ah got it. ok prioritizing. thanks.

dotnwat commented 3 years ago

will be taking a deeper look at this. @shribe were any of your topics using compaction or a retention policy based that may have kicked in (e.g. based on size, or time depending the settings and length of your experiments?)

shribe commented 3 years ago

When I used rpk to create the topic, the only option I provided was for 3 replicas. I know this discussion has gotten long, so I think it's worth pointing out that I started with your single-node demo, meaning 1.2GB RAM. On my own, I upped it to 2GB, then container restarts were less frequent and it took longer to trigger this problem. Based on the advice here, I upped it to 3GB, then container restarts become very infrequent, and I didn't trigger the problem again in a day(ish) of running.

dotnwat commented 3 years ago

that's excellent context, thanks. 2gb can definitely be pretty tight w.r.t memory pressure so it may be that an OOM was encountered. Last question: were you by chance sending compressed data with zstd? We recently fixed a memory pressure issue related to zstd.

shribe commented 3 years ago

I was using franz-go/examples/bench with defaults, which means compressed data--but not sure if it's zstd. @twmb?

twmb commented 3 years ago

The default compressor is snappy (since it's fast and generally good enough).

jcsp commented 3 years ago

There were a number of fixes for low-memory issues in the 21.8.x release cycle, so this issue may be fixed -- it would be worth re-testing with latest release.

jcsp commented 1 year ago

This ticket is ~1.5yrs old, closing.

redpanda-data / redpanda

ability to produce stalls #1711