risingwavelabs / risingwave

Best-in-class stream processing, analytics, and management. Perform continuous analytics, or build event-driven applications, real-time ETL pipelines, and feature stores in minutes. Unified streaming and batch. PostgreSQL compatible.
https://go.risingwave.com/slack
Apache License 2.0
6.96k stars 574 forks source link

Meta node can only handle ~600 Kafka sources. #18949

Open ka-weihe opened 3 days ago

ka-weihe commented 3 days ago

Describe the bug

I attempted to create 700 Kafka Plain Avro sources. However, after successfully creating approximately 600 sources, the meta-node becomes unresponsive, and I encounter the following error:

ERROR librdkafka: librdkafka: THREAD [thrd:main]: Unable to create broker thread

Additionally, when I shell into the meta-node and try to execute any commands, I get:

bash: fork: retry: Resource temporarily unavailable

To investigate, I ran the command (ps -eLf | wc -l) to monitor the number of OS threads while creating the sources. The thread count reached 32,769 after I had created 629 sources, at which point I could no longer execute commands—likely due to the OS exhausting its available resources for creating threads.

This behavior seems like a bug. I would expect the number of threads to be significantly lower than the number of sources. Even if there was only one OS thread per source, while inefficient, it would still fit within our use case requirements, but we are seeing more than 50 OS threads per source. It’s also worth noting that none of the sources were being actively used at any time.

Error message/log

ERROR librdkafka: librdkafka: THREAD [thrd:main]: Unable to create broker thread

To Reproduce

Expected behavior

I expect the sources to be created and the Meta node to not be unresponsive and use many resources.

How did you deploy RisingWave?

More or less like this: https://github.com/risingwavelabs/risingwave-operator/blob/main/docs/manifests/risingwave/risingwave-postgresql-s3.yaml

But higher limits

The version of RisingWave

PostgreSQL 13.14.0-RisingWave-2.0.1 (0d15632ac8d18fbd323982b53bc2d508dc7e148a)

Additional context

No response

xxchan commented 2 days ago

According to https://github.com/confluentinc/librdkafka/issues/1600, it seems each kafka consumer will create ~num_brokers threads, and threads cannot be shared across client instances.

In Meta, currently each source has a KafkaSplitEnumerator, which has a Kafka consumer, so this looks like exactly your problem. (1 source ~ 50 threads) In theory, we could improve Meta by sharing consumer connecting to the same broker.

However, I'm concerned that after fixing Meta, Compute nodes still cannot handle it, since it will have even more Kafka consumers (multiply by parallelism).

I'm feeling 50 brokers might be too large. Is it possible for you to divide it into multiple smaller clusters with fewer brokers? 🤔