redpanda-data / redpanda

Redpanda is a streaming data platform for developers. Kafka API compatible. 10x faster. No ZooKeeper. No JVM!
https://redpanda.com
9.64k stars 586 forks source link

`bad_alloc` running k6 benchmark against Redpanda 22.3.6 and 22.3.11 #8459

Closed mostafa closed 7 months ago

mostafa commented 1 year ago

Version & Environment

Redpanda version: (use rpk version): v22.3.6 (rev a266e4f) Docker image ID: 9ff18ef80836 OS: Debian GNU/Linux 11 (bullseye) Client: kafka-go v0.4.38 Tool: xk6-kafka v0.16.1

docker info ```bash Client: Context: default Debug Mode: false Plugins: app: Docker App (Docker Inc., v0.9.1-beta3) buildx: Docker Buildx (Docker Inc., v0.9.1) compose: Docker Compose (Docker Inc., v2.13.0) dev: Docker Dev Environments (Docker Inc., v0.0.5) extension: Manages Docker extensions (Docker Inc., v0.2.16) sbom: View the packaged-based Software Bill Of Materials (SBOM) for an image (Anchore Inc., 0.6.0) scan: Docker Scan (Docker Inc., v0.22.0) Server: Containers: 3 Running: 1 Paused: 0 Stopped: 2 Images: 14 Server Version: 22.06.0-beta.0-902-g2708be0db4.m Storage Driver: stargz driver-type: io.containerd.snapshotter.v1 Logging Driver: json-file Cgroup Driver: cgroupfs Cgroup Version: 1 Plugins: Volume: local Network: bridge host ipvlan macvlan null overlay Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog Swarm: inactive Runtimes: io.containerd.runc.v2 runc Default Runtime: runc Init Binary: docker-init containerd version: 770bd0108c32f3fb5c73ae1264f7e503fe7b2661 runc version: v1.1.4-0-g5fd4c4d init version: de40ad0 Security Options: seccomp Profile: builtin Kernel Version: 5.15.79.1-microsoft-standard-WSL2 Operating System: Docker Desktop OSType: linux Architecture: x86_64 CPUs: 16 Total Memory: 23.47GiB Name: docker-desktop ID: ... Docker Root Dir: /var/lib/docker Debug Mode: false HTTP Proxy: http.docker.internal:3128 HTTPS Proxy: http.docker.internal:3128 No Proxy: hubproxy.docker.internal Registry: https://index.docker.io/v1/ Labels: Experimental: true Insecure Registries: hubproxy.docker.internal:5000 127.0.0.0/8 Live Restore Enabled: false ```

What went wrong?

Redpanda crashed a few times, and I didn't spend much time with it, apart from increasing the resources of the Docker container, which only prolonged the time before the crash. Not to mention that the same test runs against Apache Kafka just fine on my machine with 4M messages per minute.

What should have happened instead?

Process (up to) 4M messages successfully in a minute, like Apache Kafka on the same environment as shown below:

xk6-output ```bash /\ |‾‾| /‾‾/ /‾‾/ /\ / \ | |/ / / / / \/ \ | ( / ‾‾\ / \ | |\ \ | (‾) | / __________ \ |__| \__\ \_____/ .io execution: local script: scripts/test_json.js output: - scenarios: (100.00%) 1 scenario, 50 max VUs, 1m30s max duration (incl. graceful stop): * default: 50 looping VUs for 1m0s (gracefulStop: 30s) running (1m04.4s), 00/50 VUs, 20170 complete and 0 interrupted iterations default ✓ [======================================] 50 VUs 1m0s ✓ 10 messages are received ✓ Topic equals to xk6_kafka_json_topic ✓ Key contains key/value and is JSON ✓ Value contains key/value and is JSON ✓ Header equals {'mykey': 'myvalue'} ✓ Time is past ✓ Partition is zero ✓ Offset is gte zero ✓ High watermark is gte zero █ teardown checks.........................: 100.00% ✓ 181530 ✗ 0 data_received..................: 0 B 0 B/s data_sent......................: 0 B 0 B/s iteration_duration.............: avg=153.45ms min=6.01ms med=26.8ms max=8.14s p(90)=156.3ms p(95)=206.4ms iterations.....................: 20170 313.068545/s kafka_reader_dial_count........: 50 0.776075/s kafka_reader_dial_seconds......: avg=171.22µs min=0s med=0s max=1.09s p(90)=0s p(95)=0s ✓ kafka_reader_error_count.......: 0 0/s kafka_reader_fetch_bytes_max...: 1000000 min=1000000 max=1000000 kafka_reader_fetch_bytes_min...: 1 min=1 max=1 kafka_reader_fetch_wait_max....: 200ms min=200ms max=200ms kafka_reader_fetch_bytes.......: 58 MB 897 kB/s kafka_reader_fetch_size........: 147167 2284.25179/s kafka_reader_fetches_count.....: 107 1.6608/s kafka_reader_lag...............: 1519055 min=0 max=2436190 kafka_reader_message_bytes.....: 40 MB 615 kB/s kafka_reader_message_count.....: 201749 3131.446006/s kafka_reader_offset............: 4130 min=11 max=5130 kafka_reader_queue_capacity....: 1 min=1 max=1 kafka_reader_queue_length......: 1 min=0 max=1 kafka_reader_read_seconds......: avg=96.5ms min=0s med=0s max=59.37s p(90)=0s p(95)=0s kafka_reader_rebalance_count...: 0 0/s kafka_reader_timeouts_count....: 57 0.884725/s kafka_reader_wait_seconds......: avg=102.71µs min=0s med=0s max=85.71ms p(90)=0s p(95)=0s kafka_writer_acks_required.....: 0 min=0 max=0 kafka_writer_async.............: 0.00% ✓ 0 ✗ 2017000 kafka_writer_attempts_max......: 0 min=0 max=0 kafka_writer_batch_bytes.......: 441 MB 6.8 MB/s kafka_writer_batch_max.........: 1 min=1 max=1 kafka_writer_batch_size........: 2017000 31306.854525/s kafka_writer_batch_timeout.....: 0s min=0s max=0s ✓ kafka_writer_error_count.......: 0 0/s kafka_writer_message_bytes.....: 883 MB 14 MB/s kafka_writer_message_count.....: 4034000 62613.709051/s kafka_writer_read_timeout......: 0s min=0s max=0s kafka_writer_retries_count.....: 0 0/s kafka_writer_wait_seconds......: avg=0s min=0s med=0s max=0s p(90)=0s p(95)=0s kafka_writer_write_count.......: 4034000 62613.709051/s kafka_writer_write_seconds.....: avg=523.21µs min=4.84µs med=14.48µs max=4.05s p(90)=33.85µs p(95)=42.68µs kafka_writer_write_timeout.....: 0s min=0s max=0s vus............................: 7 min=7 max=50 vus_max........................: 50 min=50 max=50 ```

How to reproduce the issue?

  1. Run Redpanda in Docker (in WSL2 or otherwise) using this command:
    docker run -d --pull=always --name=redpanda-1 \
    -p 8081:8081 \
    -p 8082:8082 \
    -p 9092:9092 \
    -p 9644:9644 \
    docker.redpanda.com/vectorized/redpanda:v23.3.6 \
    redpanda start \
    --overprovisioned \
    --seeds "redpanda-1:33145" \
    --set redpanda.empty_seed_starts_cluster=false \
    --smp 12  \
    --memory 20G \
    --reserve-memory 0M \
    --check=false \
    --advertise-rpc-addr redpanda-1:33145
  2. Follow these instructions to build k6 with xk6-kafka extension using xk6 (you need to have Go installed and available on your PATH).
  3. Run the test that is mentioned in xk6-kafka's README.
  4. Monitor Docker logs and k6 output.
  5. Docker container of Redpanda crashes with a dump and k6 throws hundreds of errors to the console and fails.

Additional information

Please attach any relevant logs, backtraces, or metric charts.

JIRA Link: CORE-1152

jcsp commented 1 year ago

Redpanda is hitting a bad_alloc almost immediately after creating a single partition -- that's pretty surprising, as you have >1GiB of memory per core and there's very little else going on in the system.

If you run the redpanda cluster the same way and produce a few records using some other utility (e.g. rpk), does it also crash? That might tell us whether there is something really broken about the way the Redpanda node is running, or if this is perhaps something related to something like a benchmark opening a large number of connections in a short time.

mostafa commented 1 year ago

Hey @jcsp,

Thanks for the clarification. It doesn't crash when running small tests using my tool. I managed to produce 200 messages and fetch 43 against a single instance with no issues in a matter of 13.42ms (avg).

Result output ```bash ./k6 run scripts/test_json.js /\ |‾‾| /‾‾/ /‾‾/ /\ / \ | |/ / / / / \/ \ | ( / ‾‾\ / \ | |\ \ | (‾) | / __________ \ |__| \__\ \_____/ .io execution: local script: scripts/test_json.js output: - scenarios: (100.00%) 1 scenario, 1 max VUs, 10m30s max duration (incl. graceful stop): * default: 1 iterations for each of 1 VUs (maxDuration: 10m0s, gracefulStop: 30s) running (00m00.0s), 0/1 VUs, 1 complete and 0 interrupted iterations default ✓ [======================================] 1 VUs 00m00.0s/10m0s 1/1 iters, 1 per VU ✓ 10 messages are received ✓ Topic equals to xk6_kafka_json_topic ✓ Key contains key/value and is JSON ✓ Value contains key/value and is JSON ✓ Header equals {'mykey': 'myvalue'} ✓ Time is past ✓ Partition is zero ✓ Offset is gte zero ✓ High watermark is gte zero █ teardown checks.........................: 100.00% ✓ 9 ✗ 0 data_received..................: 0 B 0 B/s data_sent......................: 0 B 0 B/s iteration_duration.............: avg=13.42ms min=1.33ms med=13.42ms max=25.51ms p(90)=23.09ms p(95)=24.3ms iterations.....................: 1 32.950186/s kafka_reader_dial_count........: 1 32.950186/s kafka_reader_dial_seconds......: avg=1.21ms min=1.21ms med=1.21ms max=1.21ms p(90)=1.21ms p(95)=1.21ms ✓ kafka_reader_error_count.......: 0 0/s kafka_reader_fetch_bytes.......: 0 B 0 B/s kafka_reader_fetch_bytes_max...: 1000000 min=1000000 max=1000000 kafka_reader_fetch_bytes_min...: 1 min=1 max=1 kafka_reader_fetch_size........: 0 0/s kafka_reader_fetch_wait_max....: 10s min=10s max=10s kafka_reader_fetches_count.....: 1 32.950186/s kafka_reader_lag...............: 2126029 min=2126029 max=2126029 kafka_reader_message_bytes.....: 8.4 kB 276 kB/s kafka_reader_message_count.....: 43 1416.858014/s kafka_reader_offset............: 43 min=43 max=43 kafka_reader_queue_capacity....: 100 min=100 max=100 kafka_reader_queue_length......: 33 min=33 max=33 kafka_reader_read_seconds......: avg=0s min=0s med=0s max=0s p(90)=0s p(95)=0s kafka_reader_rebalance_count...: 0 0/s kafka_reader_timeouts_count....: 0 0/s kafka_reader_wait_seconds......: avg=3.41ms min=3.41ms med=3.41ms max=3.41ms p(90)=3.41ms p(95)=3.41ms kafka_writer_acks_required.....: 0 min=0 max=0 kafka_writer_async.............: 0.00% ✓ 0 ✗ 100 kafka_writer_attempts_max......: 0 min=0 max=0 kafka_writer_batch_bytes.......: 22 kB 721 kB/s kafka_writer_batch_max.........: 1 min=1 max=1 kafka_writer_batch_size........: 100 3295.018637/s kafka_writer_batch_timeout.....: 0s min=0s max=0s ✓ kafka_writer_error_count.......: 0 0/s kafka_writer_message_bytes.....: 44 kB 1.4 MB/s kafka_writer_message_count.....: 200 6590.037274/s kafka_writer_read_timeout......: 0s min=0s max=0s kafka_writer_retries_count.....: 0 0/s kafka_writer_wait_seconds......: avg=0s min=0s med=0s max=0s p(90)=0s p(95)=0s kafka_writer_write_count.......: 200 6590.037274/s kafka_writer_write_seconds.....: avg=30.96µs min=16.05µs med=25.23µs max=386.43µs p(90)=37.58µs p(95)=43.3µs kafka_writer_write_timeout.....: 0s min=0s max=0s ```
emaxerrno commented 1 year ago

@mostafa w/ the latest 22.3.11 && 2G/core does it still crash. just curious.

mostafa commented 1 year ago

@emaxerrno Will try and let you know.

mostafa commented 1 year ago

@emaxerrno This is the result of running it against v22.3.11. Please check to see if I set flags correctly. BTW, I just increased --reserved-memory to 2G. The rest of the flags remained the same.

emaxerrno commented 1 year ago

@mostafa - No, flags used are incorrect. use this instead:

--smp 10 --memory 20G 

and remove --reserve-memory flag

You can either decrease --smp, or increase --memory such that you get 2G/core --smp is the max core count and should be mapped to at most the number of vcpus attached to the container, so if the container is 2 cores, then smp can be at max 2. if that fails, try 3G/core - could be a filesystem issue on WSDL pushing latency bounds. but let me know once you run it w/ the right flags. also remove --reserve-memory 2G. Note --overprovisioned uses libevent and interrupts - i.e.: not prod setup, so is useful because redpanda starts up a bit faster, but runtime is slower w/ interrupts due to IPI cost of the interrupts.

mostafa commented 1 year ago

@emaxerrno Tried these flags:

docker run -d --pull=always --name=redpanda-1 \
-p 8081:8081 \
-p 8082:8082 \
-p 9092:9092 \
-p 9644:9644 \
vectorized/redpanda:v22.3.11 \
redpanda start \
--seeds "redpanda-1:33145" \
--set redpanda.empty_seed_starts_cluster=false \
--smp 10  \
--memory 20G \
--check=false \
--advertise-rpc-addr redpanda-1:33145

And the same result (crash at the end of the test):

dotnwat commented 7 months ago

22.x is EOL and Redpanda has received significant improvements to handling larger memory allocations. We'd certainly be interested if this is continuing to occur.