Closed mostafa closed 7 months ago
Redpanda is hitting a bad_alloc
almost immediately after creating a single partition -- that's pretty surprising, as you have >1GiB of memory per core and there's very little else going on in the system.
If you run the redpanda cluster the same way and produce a few records using some other utility (e.g. rpk
), does it also crash? That might tell us whether there is something really broken about the way the Redpanda node is running, or if this is perhaps something related to something like a benchmark opening a large number of connections in a short time.
Hey @jcsp,
Thanks for the clarification. It doesn't crash when running small tests using my tool. I managed to produce 200 messages and fetch 43 against a single instance with no issues in a matter of 13.42ms (avg).
@mostafa w/ the latest 22.3.11 && 2G/core does it still crash. just curious.
@emaxerrno Will try and let you know.
@emaxerrno This is the result of running it against v22.3.11. Please check to see if I set flags correctly. BTW, I just increased --reserved-memory
to 2G. The rest of the flags remained the same.
@mostafa - No, flags used are incorrect. use this instead:
--smp 10 --memory 20G
and remove --reserve-memory
flag
You can either decrease --smp, or increase --memory such that you get 2G/core --smp
is the max core count and should be mapped to at most the number of vcpus attached to the container, so if the container is 2 cores, then smp can be at max 2. if that fails, try 3G/core - could be a filesystem issue on WSDL pushing latency bounds. but let me know once you run it w/ the right flags. also remove --reserve-memory 2G. Note --overprovisioned
uses libevent and interrupts - i.e.: not prod setup, so is useful because redpanda starts up a bit faster, but runtime is slower w/ interrupts due to IPI cost of the interrupts.
@emaxerrno Tried these flags:
docker run -d --pull=always --name=redpanda-1 \
-p 8081:8081 \
-p 8082:8082 \
-p 9092:9092 \
-p 9644:9644 \
vectorized/redpanda:v22.3.11 \
redpanda start \
--seeds "redpanda-1:33145" \
--set redpanda.empty_seed_starts_cluster=false \
--smp 10 \
--memory 20G \
--check=false \
--advertise-rpc-addr redpanda-1:33145
And the same result (crash at the end of the test):
22.x is EOL and Redpanda has received significant improvements to handling larger memory allocations. We'd certainly be interested if this is continuing to occur.
Version & Environment
Redpanda version: (use
rpk version
): v22.3.6 (rev a266e4f) Docker image ID: 9ff18ef80836 OS: Debian GNU/Linux 11 (bullseye) Client: kafka-go v0.4.38 Tool: xk6-kafka v0.16.1docker info
```bash Client: Context: default Debug Mode: false Plugins: app: Docker App (Docker Inc., v0.9.1-beta3) buildx: Docker Buildx (Docker Inc., v0.9.1) compose: Docker Compose (Docker Inc., v2.13.0) dev: Docker Dev Environments (Docker Inc., v0.0.5) extension: Manages Docker extensions (Docker Inc., v0.2.16) sbom: View the packaged-based Software Bill Of Materials (SBOM) for an image (Anchore Inc., 0.6.0) scan: Docker Scan (Docker Inc., v0.22.0) Server: Containers: 3 Running: 1 Paused: 0 Stopped: 2 Images: 14 Server Version: 22.06.0-beta.0-902-g2708be0db4.m Storage Driver: stargz driver-type: io.containerd.snapshotter.v1 Logging Driver: json-file Cgroup Driver: cgroupfs Cgroup Version: 1 Plugins: Volume: local Network: bridge host ipvlan macvlan null overlay Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog Swarm: inactive Runtimes: io.containerd.runc.v2 runc Default Runtime: runc Init Binary: docker-init containerd version: 770bd0108c32f3fb5c73ae1264f7e503fe7b2661 runc version: v1.1.4-0-g5fd4c4d init version: de40ad0 Security Options: seccomp Profile: builtin Kernel Version: 5.15.79.1-microsoft-standard-WSL2 Operating System: Docker Desktop OSType: linux Architecture: x86_64 CPUs: 16 Total Memory: 23.47GiB Name: docker-desktop ID: ... Docker Root Dir: /var/lib/docker Debug Mode: false HTTP Proxy: http.docker.internal:3128 HTTPS Proxy: http.docker.internal:3128 No Proxy: hubproxy.docker.internal Registry: https://index.docker.io/v1/ Labels: Experimental: true Insecure Registries: hubproxy.docker.internal:5000 127.0.0.0/8 Live Restore Enabled: false ```What went wrong?
Redpanda crashed a few times, and I didn't spend much time with it, apart from increasing the resources of the Docker container, which only prolonged the time before the crash. Not to mention that the same test runs against Apache Kafka just fine on my machine with 4M messages per minute.
What should have happened instead?
Process (up to) 4M messages successfully in a minute, like Apache Kafka on the same environment as shown below:
xk6-output
```bash /\ |‾‾| /‾‾/ /‾‾/ /\ / \ | |/ / / / / \/ \ | ( / ‾‾\ / \ | |\ \ | (‾) | / __________ \ |__| \__\ \_____/ .io execution: local script: scripts/test_json.js output: - scenarios: (100.00%) 1 scenario, 50 max VUs, 1m30s max duration (incl. graceful stop): * default: 50 looping VUs for 1m0s (gracefulStop: 30s) running (1m04.4s), 00/50 VUs, 20170 complete and 0 interrupted iterations default ✓ [======================================] 50 VUs 1m0s ✓ 10 messages are received ✓ Topic equals to xk6_kafka_json_topic ✓ Key contains key/value and is JSON ✓ Value contains key/value and is JSON ✓ Header equals {'mykey': 'myvalue'} ✓ Time is past ✓ Partition is zero ✓ Offset is gte zero ✓ High watermark is gte zero █ teardown checks.........................: 100.00% ✓ 181530 ✗ 0 data_received..................: 0 B 0 B/s data_sent......................: 0 B 0 B/s iteration_duration.............: avg=153.45ms min=6.01ms med=26.8ms max=8.14s p(90)=156.3ms p(95)=206.4ms iterations.....................: 20170 313.068545/s kafka_reader_dial_count........: 50 0.776075/s kafka_reader_dial_seconds......: avg=171.22µs min=0s med=0s max=1.09s p(90)=0s p(95)=0s ✓ kafka_reader_error_count.......: 0 0/s kafka_reader_fetch_bytes_max...: 1000000 min=1000000 max=1000000 kafka_reader_fetch_bytes_min...: 1 min=1 max=1 kafka_reader_fetch_wait_max....: 200ms min=200ms max=200ms kafka_reader_fetch_bytes.......: 58 MB 897 kB/s kafka_reader_fetch_size........: 147167 2284.25179/s kafka_reader_fetches_count.....: 107 1.6608/s kafka_reader_lag...............: 1519055 min=0 max=2436190 kafka_reader_message_bytes.....: 40 MB 615 kB/s kafka_reader_message_count.....: 201749 3131.446006/s kafka_reader_offset............: 4130 min=11 max=5130 kafka_reader_queue_capacity....: 1 min=1 max=1 kafka_reader_queue_length......: 1 min=0 max=1 kafka_reader_read_seconds......: avg=96.5ms min=0s med=0s max=59.37s p(90)=0s p(95)=0s kafka_reader_rebalance_count...: 0 0/s kafka_reader_timeouts_count....: 57 0.884725/s kafka_reader_wait_seconds......: avg=102.71µs min=0s med=0s max=85.71ms p(90)=0s p(95)=0s kafka_writer_acks_required.....: 0 min=0 max=0 kafka_writer_async.............: 0.00% ✓ 0 ✗ 2017000 kafka_writer_attempts_max......: 0 min=0 max=0 kafka_writer_batch_bytes.......: 441 MB 6.8 MB/s kafka_writer_batch_max.........: 1 min=1 max=1 kafka_writer_batch_size........: 2017000 31306.854525/s kafka_writer_batch_timeout.....: 0s min=0s max=0s ✓ kafka_writer_error_count.......: 0 0/s kafka_writer_message_bytes.....: 883 MB 14 MB/s kafka_writer_message_count.....: 4034000 62613.709051/s kafka_writer_read_timeout......: 0s min=0s max=0s kafka_writer_retries_count.....: 0 0/s kafka_writer_wait_seconds......: avg=0s min=0s med=0s max=0s p(90)=0s p(95)=0s kafka_writer_write_count.......: 4034000 62613.709051/s kafka_writer_write_seconds.....: avg=523.21µs min=4.84µs med=14.48µs max=4.05s p(90)=33.85µs p(95)=42.68µs kafka_writer_write_timeout.....: 0s min=0s max=0s vus............................: 7 min=7 max=50 vus_max........................: 50 min=50 max=50 ```How to reproduce the issue?
PATH
).Additional information
Please attach any relevant logs, backtraces, or metric charts.
JIRA Link: CORE-1152