redpanda-data / deployment-automation

Cluster configuration best practices
https://redpanda.com
Apache License 2.0
64 stars 46 forks source link

inotify file descriptor issue on large instances causes redpanda galaxy module to crash #211

Open WesWWagner opened 8 months ago

WesWWagner commented 8 months ago

When building a 15 node im4gn cluster with TLS and prometheus monitoring enabled, I have an issue where Redpanda fails to start due to the following message:

ubuntu@ip-172-31-16-44:~$ journalctl -f -u redpanda | grep -i error Jan 23 01:13:08 ip-172-31-16-44 rpk[12253]: ERROR 2024-01-23 01:13:08,030 [shard 0] main - application.cc:388 - Failure during startup: std::__1::system_error (error system:24, could not create inotify instance: Too many open files)

ubuntu@ip-172-31-16-44:~$ ulimit -n 1024

I have not yet looked into the code for the galaxy component but something is not configuring enough inode and linux security widgets before spooling up redpanda for the first time on large instances (which will start more threads because of more cores, etc)

I tested this on 23.3.3 and 23.2.10 and received the same behavior so it is not a recent regression.