Closed shribe closed 1 year ago
Update: at about 3.5 hours, the producer to the single-node cluster quit as follows:
66.44 MiB/s; 696.64k records/s
67.77 MiB/s; 710.57k records/s
53.30 MiB/s; 558.87k records/s
0.00 MiB/s; 0.00k records/s
0.00 MiB/s; 0.00k records/s
0.00 MiB/s; 0.00k records/s
0.00 MiB/s; 0.00k records/s
0.00 MiB/s; 0.00k records/s
0.00 MiB/s; 0.00k records/s
0.00 MiB/s; 0.00k records/s
0.00 MiB/s; 0.00k records/s
0.00 MiB/s; 0.00k records/s
90.33 MiB/s; 947.22k records/s
71.89 MiB/s; 753.79k records/s
53.70 MiB/s; 563.08k records/s
26.85 MiB/s; 281.50k records/s
0.00 MiB/s; 0.00k records/s
0.00 MiB/s; 0.00k records/s
0.00 MiB/s; 0.00k records/s
304.17 MiB/s; 3189.45k records/s
142.15 MiB/s; 1490.52k records/s
101.49 MiB/s; 1064.16k records/s
produce error: EOF
@shribe what redpanda version?
To me looks like something is probably killing the pods. are you running them in non-dev mode? and what mem/cpu... can you add more context so we can debug.
fio
give you w/ a 10G file seq-write-readw/ the benchmark, when kgo receives NOT_LEADER_FOR_PARTITION
, it sets the partition to "failed" and triggers metadata to be updated. While the partition is failed, the client does not try producing that partition. A metadata update always clears the failure state (that's why we see the producing start back up after the "read metadata" log). However, the client also ensures that metadata updates are not spammed, so for this particular failure, the client ensures that at least MetadataMinAge
has passed before re-requesting metadata -- that's likely why we see things wait for a bit before the metadata request is issued.
We can see that the metadata update did not change the leader broker: the logs before the first NOT_LEADER_FOR_PARTITION
show things producing to broker 2, the logs after the partition show the same. rpk
agrees with this, the output shows the leader is broker 2.
The EOF at the end is due to Redpanda closing the connection on us unexpectedly. Currently that is not a retriable error. I'm not sure it is worth it to make that particular error retriable -- there's not much of a reason for a broker to close a connection randomly when we are reading. This likely can be a retriable error but I'll have to think through the ramifications.
To me looks like something is probably killing the pods. are you running them in non-dev mode? and what mem/cpu... can you add more context so we can debug.
Sorry, I switched dev mode off, then at some point accidentally reverted to yaml with dev mode on, so I don't know which tests were on vs off.
- version
21.6.4
- cpu to mem ratio (2GB+/core for bench is needed) + 10% for system stuff make sure the pod doesn't get killed
I started with your example, which had 1.2GB per CPU, which did this pretty quickly. I bumped it up to 2GB/CPU, which resulted in it taking a long time. Based on what you said above I went up to 3GB/CPU, and it's now been running 10 hours straight without stalling. But I do see multiple restarts on each pod, the logs of the dead containers end like this:
DEBUG 2021-06-29 00:56:58,115 [shard 0] compaction_ctrl - backlog_controller.cc:128 - updating shares 10
6/28/2021 6:57:11 PM DEBUG 2021-06-29 00:57:11,126 [shard 0] storage - readers_cache.cc:199 - {kafka/test-1/0} - evicting reader from cache, range [19184498999,19217865500]
6/28/2021 6:57:11 PM INFO 2021-06-29 00:57:11,126 [shard 0] storage - segment.cc:626 - Creating new segment /var/lib/redpanda/data/kafka/test-1/0_8/19217865501-5-v1.log
6/28/2021 6:57:18 PM DEBUG 2021-06-29 00:57:18,458 [shard 0] rpc - server.cc:91 - kafka rpc protocol Closing: 127.0.0.1:42356
6/28/2021 6:57:18 PM DEBUG 2021-06-29 00:57:18,459 [shard 0] rpc - server.cc:91 - kafka rpc protocol Closing: 10.250.5.172:34366
6/28/2021 6:57:18 PM ERROR 2021-06-29 00:57:18,514 [shard 0] assert - Assert failure: (../../../src/v/storage/segment_appender.cc:494) 'false' Could not dma_write: std::__1::system_error (error system:12, Cannot allocate memory) - {no_of_chunks:64, closed:0, fallocation_offset:67108864, committed_offset:55831505, bytes_flush_pending:0}
6/28/2021 6:57:18 PM ERROR 2021-06-29 00:57:18,514 [shard 0] assert - Backtrace below:
6/28/2021 6:57:18 PM 0x2af2934 0x2310676 0x231062f 0x2310605 0x23108b9 0x2870164 0x2872dd7 0x279c0e9 0x279a140 0xd652a5 0x2d910ec /opt/redpanda/lib/libc.so.6+0x281a1 0xd6330d
6/28/2021 6:57:18 PM --------
6/28/2021 6:57:18 PM seastar::continuation<seastar::internal::promise_base_with_type<void>, seastar::future<void> seastar::future<void>::handle_exception<storage::segment_appender::dispatch_background_head_write()::$_21>(storage::segment_appender::dispatch_background_head_write()::$_21&&)::'lambda'(storage::segment_appender::dispatch_background_head_write()::$_21&&), seastar::futurize<storage::segment_appender::dispatch_background_head_write()::$_21>::type seastar::future<void>::then_wrapped_nrvo<seastar::future<void>, seastar::future<void> seastar::future<void>::handle_exception<storage::segment_appender::dispatch_background_head_write()::$_21>(storage::segment_appender::dispatch_background_head_write()::$_21&&)::'lambda'(storage::segment_appender::dispatch_background_head_write()::$_21&&)>(seastar::future<void> seastar::future<void>::handle_exception<storage::segment_appender::dispatch_background_head_write()::$_21>(storage::segment_appender::dispatch_background_head_write()::$_21&&)::'lambda'(storage::segment_appender::dispatch_background_head_write()::$_21&&)&&)::'lambda'(seastar::internal::promise_base_with_type<void>&&, seastar::future<void> seastar::future<void>::handle_exception<storage::segment_appender::dispatch_background_head_write()::$_21>(storage::segment_appender::dispatch_background_head_write()::$_21&&)::'lambda'(storage::segment_appender::dispatch_background_head_write()::$_21&&)&, seastar::future_state<seastar::internal::monostate>&&), void>
- did you tune the kernel / etc to not pay a block-io copy ? (rpk mode production & rpk tune all) - otherwise you leave large perf gains
No--I don't have access to do that in this cluster. (I could in another cluster, and might out of curiosity, but the performance we are seeing untuned is already well above what we need. What we need to see is stability.)
- what does
fio
give you w/ a 10G file seq-write-read
==================
= Dbench Summary =
==================
Random Read/Write IOPS: 398k/66.4k. BW: 5481MiB/s / 2656MiB/s
Average Latency (usec) Read/Write: 99.49/70.91
Sequential Read/Write: 6424MiB/s / 3080MiB/s
Mixed Random Read/Write IOPS: 135k/44.8k
w/ the benchmark, when kgo receives
NOT_LEADER_FOR_PARTITION
, it sets the partition to "failed" and triggers metadata to be updated. While the partition is failed, the client does not try producing that partition. A metadata update always clears the failure state (that's why we see the producing start back up after the "read metadata" log). However, the client also ensures that metadata updates are not spammed, so for this particular failure, the client ensures that at leastMetadataMinAge
has passed before re-requesting metadata -- that's likely why we see things wait for a bit before the metadata request is issued.
That sounds like exactly what I observe when it stalls briefly then resumes producing. However, after a few such stalls, the failure appears to be permanent. I once left it for a full hour, and it logged 0 records/s continuously until I killed it.
We can see that the metadata update did not change the leader broker: the logs before the first
NOT_LEADER_FOR_PARTITION
show things producing to broker 2, the logs after the partition show the same.rpk
agrees with this, the output shows the leader is broker 2.
OK, I had assumed the leader was changing, but it must be some other reason that it wedges.
The EOF at the end is due to Redpanda closing the connection on us unexpectedly. Currently that is not a retriable error. I'm not sure it is worth it to make that particular error retriable -- there's not much of a reason for a broker to close a connection randomly when we are reading. This likely can be a retriable error but I'll have to think through the ramifications.
I'm more concerned about redpanda closing the connection unexpectedly than about how the client side handles it.
However, after a few such stalls, the failure appears to be permanent
I wonder if this is a client error or if redpanda has entered a fatal state. I can see about potentially reproducing this locally, but @dotnwat or @mmaslankaprv it's likely worth checking out those redpanda logs above^
@shribe is it possible to benchmark with another utility -- say, librdkafka's kafkardkafka_performance.c
? (I think @dotnwat may have an incantation you can use here). I'm curious if this is able to get (1) a similar throughput, and (2) if it hits an unrecoverable point.
I wonder if this is a client error or if redpanda has entered a fatal state.
Quitting & restarting the client doesn't help, so redpanda is in a fatal state.
@shribe, thank you for detailed description, I will try to reproduce this issue on our side and will reach out to you as soon as I will have some more info.
So, running with 3GB/CPU, it managed 22 hours straight without stalling once. (The test was terminated by my VPN max session timeout, NOT by redpanda failure.) Over that 22 hours I saw a modest decline in throughput--perfectly reasonable for index(es) depth increase when pushing a few 10s of billions of rows into a single partition.
From the time that I noted above that at ~10 hours there had been a few restarts (2, 3, 4), during the next ~12 hours there were no additional restarts.
The restarts by themselves are not a critical issue, but right now I worry that there might be a fatal condition triggered by multiple restarts, perhaps multiple restarts nearly simultaneous--such that increasing memory isn't actually a 100% fix, but just makes it far less likely (since with higher memory I still see a few restarts).
I will probably be able to give rdkafka_performance a try later today, and will run with develop mode off. Heck I might give it a go with ZFS volumes as well.
Awesome, thank you! And from the client side, please let me know if rdkafka_performance has higher throughput or lower :)
6/28/2021 6:57:18 PM DEBUG 2021-06-29 00:57:18,459 [shard 0] rpc - server.cc:91 - kafka rpc protocol Closing: 10.250.5.172:34366 6/28/2021 6:57:18 PM ERROR 2021-06-29 00:57:18,514 [shard 0] assert - Assert failure: (../../../src/v/storage/segment_appender.cc:494) 'false' Could not dma_write: std::__1::system_error (error system:12, Cannot allocate memory) - {no_of_chunks:64, closed:0, fallocation_offset:67108864, committed_offset:55831505, bytes_flush_pending:0}
@mmaslankaprv this might be because we aren't applying any back pressure in the segment appender
Awesome, thank you! And from the client side, please let me know if rdkafka_performance has higher throughput or lower :)
Very roughly, rdkafka seems to produce at ~30% faster. (All tests run with 100-byte payloads.)
(And I won't be testing on ZFS, because of RBAC/privs issues in that cluster and it's not important enough to me to dig and figure out the fix. Probably not hard really--just super-low importance.)
@shribe quick question we'll try to reproduce this locally: you've got three replicas. how many partitions are in the topic you are producing to, and what is the acks setting (or maybe its the default for that bench program).
1 partition, ack settings are left at defaults
something really curious: when I run the same tests on a test cluster with a Weka CSI for storage, the rdkafka performance is basically the same, but bench performance about doubles (and is thus 40% faster than rdkafka)
different hardware & different network in addition to the different storage, so many variables...
What are the flags you are invoking rdkafka_performance with, as well as the flags you're invoking franz-go's bench with?
./bench -topic test-1 -brokers redpanda-4.redpanda-dev.svc.cluster.local:9092 -pool
./rdkafka_performance -P -t test-2 -b redpanda-4.redpanda-dev.svc.cluster.local:9092 -s 100 -i 1000
Awesome thanks, so there are a few differences out of the box between the two benchmarking utilities. librdkafka does not use compression by default, it uses a static (non-variable) message, and it uses 1s of lingering.
I just pushed one commit to the benchmarking utility, so if you want a like-for-like comparison against librdkafka's rdkafka_performance
, try the following flags: -linger 1s -compression none -static-record
. Alternatively, drop the compression flag and add -z snappy
to rdkafka_performance
.
Really quick checks: the linger speeds it up a bit, static a tiny bit, turning compression off slows it down (which points to network, I think)
Is there more to come on this issue?
hi @isaacsanders - did you hit the same issue? or what are you looking to learn. this is running on a non xfs filesystem, so we haven't done as much testing on zfs or other filesystems that are not xfs. so we haven't prioritized yet, but we should get to this issue soon, hopefully.
@senior7515 I work with @shribe and I couldn’t determine the state of the issue. There’s a lot of information shared here, but I can’t see if there’s an action item left. Are you all still working to reproduce the issue, or did we get a different answer with the new system?
@isaacsanders - makes sense. so we tried on xfs - regular filesystem and couldn't repro, but we want to get to the bottom of it, we just haven't had a chance yet.
@mmaslankaprv & @dotnwat - how can @shribe and @isaacsanders help us debug the issue?
this is running on a non xfs filesystem
Clarification: tests were primarily run on XFS, and problem was encountered there. (Tests on Weka were run because we are currently evaluating Weka for many uses. Tests on ZFS were not run.)
ah got it. ok prioritizing. thanks.
will be taking a deeper look at this. @shribe were any of your topics using compaction or a retention policy based that may have kicked in (e.g. based on size, or time depending the settings and length of your experiments?)
When I used rpk to create the topic, the only option I provided was for 3 replicas. I know this discussion has gotten long, so I think it's worth pointing out that I started with your single-node demo, meaning 1.2GB RAM. On my own, I upped it to 2GB, then container restarts were less frequent and it took longer to trigger this problem. Based on the advice here, I upped it to 3GB, then container restarts become very infrequent, and I didn't trigger the problem again in a day(ish) of running.
that's excellent context, thanks. 2gb can definitely be pretty tight w.r.t memory pressure so it may be that an OOM was encountered. Last question: were you by chance sending compressed data with zstd? We recently fixed a memory pressure issue related to zstd.
I was using franz-go/examples/bench with defaults, which means compressed data--but not sure if it's zstd. @twmb?
The default compressor is snappy (since it's fast and generally good enough).
There were a number of fixes for low-memory issues in the 21.8.x release cycle, so this issue may be fixed -- it would be worth re-testing with latest release.
This ticket is ~1.5yrs old, closing.
Just running benchmarks, against a cluster of 3 replicas, in kubernetes, with all other settings the same as your single replica example. After a while of running at 600K-1.4M rps (depending on which k8s cluster), we usually get a run of 0rps reported, then it picks back up. After some small number of iterations of this, it stalls permanently. (Quitting and restarting client does not help.)
We have noticed that the redpanda containers are restarting, and the stalls to 0rps happen around the time of the restarts. The logs around the container quits seem to reflect memory allocation activity. It seems that if I increase memory allocation, it takes substantially longer for this to happen.
It seems that if a cluster is run with only 1 replica, this does not happen (of course it's not definitive, but my single-node test has been running for almost 3 hours, much longer than a 3-replica test has ever lasted).
The containers are not running out of disk.
When it happens, the test application from franz-go/examples/bench run at debug log level produces the following output:
A check with
rpk
shows this:It looks kind of like maybe the container restarts shuffled the leader around, but something didn't get updated with the new leader info. (At this point, all the containers have multiple restarts: 2, 3, 3.) In fact, taking a look at the still-running single-node one, it has 4 restarts, but the bench utility has never reported less than 500Krps--if the problem is propagating new leader on crash, that wouldn't affect a single node cluster, right?
I've attached redpanda logs; the numbers 0-2 are the node numbers; "a" is the prior log of the container, in other words the "crashed" one; "b" is the log of the currently running one when the above was happening.
0a.gz 0b.gz 1a.gz 1b.gz 2a.gz 2b.gz