redpanda-data / redpanda

Redpanda is a streaming data platform for developers. Kafka API compatible. 10x faster. No ZooKeeper. No JVM!
https://redpanda.com
9.51k stars 580 forks source link

rpc::transport can leak background futures/tasks causing OOMs #12682

Open StephanDollberg opened 1 year ago

StephanDollberg commented 1 year ago

Version & Environment

Redpanda version: dev / v23.2.4

What went wrong?

Note the same can happen in the rpc::server on the return path: https://github.com/redpanda-data/redpanda/blob/v23.2.4/src/v/rpc/rpc_server.cc#L263

What should have happened instead?

Don't OOM

How to reproduce the issue?

The following benchrunner config:

environment:
  client:
    provider: aws
    provider_config:
      client_instance_type: c5n.9xlarge
      aws_region: us-west-2                                                                                                                                                                 
      aws_availability_zone: us-west-2a    
  redpanda:
    provider: aws
    provider_config:
      nodes: 3
      instance_type: i3en.6xlarge
      aws_region: us-west-2                                                                                                                                                                 
      aws_availability_zone: us-west-2a    
      enable_monitoring: true 
      prometheus_instance_type: r5.8xlarge

deployment:
  prometheus_scrape_interval: 60s 
  prometheus_scrape_timeout: 60s 
  openmessaging_benchmark_repo: https://github.com/redpanda-data/openmessaging-benchmark
  openmessaging_benchmark_version: main 

benchmark:
  provider: client_swarm.ClientSwarm
  client_count: 16
  topics: 
    - foobar0
    - foobar1
    - foobar2
    - foobar3
    - foobar4
    - foobar5
    - foobar6
    - foobar7
    - foobar8
    - foobar9
  producers:
    connections: 5000
    message_size: 1000
    message_count: 6000
    messages_per_second: 10
    properties:
      queue.buffering.max.kbytes: 2
  consumers:
    connections: 5000
    message_count: 30000000
    properties:
      queued.max.messages.kbytes: 2
      auto.offset.reset: latest

Prefixed by the setup:

ansible redpanda[0] -f 20 -m shell -a "rpk cluster config set topic_partitions_per_shard 10000" --become -i workspace/hosts_aws_default.yaml
for i in {0..10} ; do ansible redpanda[0] -f 20 -m shell -a "rpk topic create foobar${i} -p 4000 -r 3" --become -i workspace/hosts_aws_default.yaml ; done

Additional information

Memory sampler top sites attached: oom_log.txt

JIRA Link: CORE-1393

dotnwat commented 1 year ago

this is a great find @StephanDollberg nice!

github-actions[bot] commented 8 months ago

This issue hasn't seen activity in 3 months. If you want to keep it open, post a comment or remove the stale label – otherwise this will be closed in two weeks.

piyushredpanda commented 8 months ago

@StephanDollberg this is still valid, yes?