networkupstools / nut

The Network UPS Tools repository. UPS management protocol Informational RFC 9271 published by IETF at https://www.rfc-editor.org/info/rfc9271 Please star NUT on GitHub, this helps with sponsorships!
https://networkupstools.org/
Other
1.92k stars 345 forks source link

Thread count in parallel `nut-scanner` should scale down in case of "Too many open files" #2576

Open jimklimov opened 1 month ago

jimklimov commented 1 month ago

As slightly noted in issue #2575 and in PRs that dealt with parallelized scans in nut-scanner, depending on platform defaults and particular OS deployment and third-party library specifics, nut-scanner may run out of file descriptors despite already trying to adapt the maximums to ulimit information where available.

As seen recently and culminating in commit 2c3a09ef0cbc845d53f603fdf9316c6f0f901979 of PR #2539 (issue #2511), certain libnetsnmp builds can consume FD's for network sockets, local filesystem looking for per-host configuration files or MIB files, for directory scanning during those searches, etc. This is a variable beyond our control, different implementations and versions of third-party code can behave as they please. Example staged with that commit reverted and a scan of a large network range:

...
   0.321562     [D5] nutscan_ip_ranges_iter_inc: got IP from range: 172.28.67.254
   0.321597     [D4] nutscan_scan_ip_range_snmp: max_threads_scantype=0 curr_threads=1022 thread_count=1022 stwST=-1 stwS=0 pass=1
   0.321573     [D2] Entering try_SysOID_thready for 172.28.67.253
   0.321667     [D5] nutscan_ip_ranges_iter_inc: got IP from range: 172.28.67.255
   0.321703     [D4] nutscan_scan_ip_range_snmp: max_threads_scantype=0 curr_threads=1023 thread_count=1023 stwST=-1 stwS=0 pass=1
   0.321677     [D2] Entering try_SysOID_thready for 172.28.67.254
   0.321782     [D5] nutscan_ip_ranges_iter_inc: got IP from range: 172.28.68.0
   0.321817     [D4] nutscan_scan_ip_range_snmp: max_threads_scantype=0 curr_threads=1024 thread_count=1024 stwST=-1 stwS=-1 pass=0
   0.321851     [D2] nutscan_scan_ip_range_snmp: Running too many scanning threads (1024), waiting until older ones would finish
   0.321796     [D2] Entering try_SysOID_thready for 172.28.67.255
   0.475060     [D2] Failed to open SNMP session for 172.28.67.147
/var/lib/snmp/hosts/172.28.66.252.local.conf: Too many open files
/var/lib/snmp/hosts/172.28.65.208.local.conf: Too many open files

<blocks on "too many threads" anyway, but skips a number of hosts> 

What we can do is not abort the scans upon any hiccup, but checking for errno==EMFILE and delaying and retrying later (or maybe even actively decreasing the thread maximum variable of the process). We already have a way to detect Running too many scanning threads (NUM), waiting until older ones would finish so that's about detecting the issue and extending criteria.

jimklimov commented 1 month ago

Experimented with a change to log errno - and yes: at nut-scanner level, at least for this use-case, we do know the cause of the problem:

diff --git a/tools/nut-scanner/nut-scanner.c b/tools/nut-scanner/nut-scanner.c
index a3d785f5a..711dc3307 100644
--- a/tools/nut-scanner/nut-scanner.c
+++ b/tools/nut-scanner/nut-scanner.c
@@ -84,7 +84,7 @@
  * Another +1 is for NetSNMP which wants to open MIB files,
  * potential per-host configuration files, etc.
  */
-#   define RESERVE_FD_COUNT 4
+#   define RESERVE_FD_COUNT 0
 #  endif /* HAVE_SYS_RESOURCE_H */
 # endif  /* HAVE_PTHREAD_TRYJOIN || HAVE_SEMAPHORE_UNNAMED || HAVE_SEMAPHORE_NAMED */
 #endif   /* HAVE_PTHREAD */
diff --git a/tools/nut-scanner/scan_snmp.c b/tools/nut-scanner/scan_snmp.c
index a8c3b42cb..fc3826454 100644
--- a/tools/nut-scanner/scan_snmp.c
+++ b/tools/nut-scanner/scan_snmp.c
@@ -969,7 +969,7 @@ static void * try_SysOID_thready(void * arg)
        /* Open the session */
        handle = wrap_nut_snmp_sess_open(&snmp_sess); /* establish the session */
        if (handle == NULL) {
-               upsdebugx(2,
+               upsdebug_with_errno(2,
                        "Failed to open SNMP session for %s",
                        sec->peername);
                goto try_SysOID_free;

...leads to:

...
   0.296940     [D2] Entering try_SysOID_thready for 172.28.67.252
   0.297073     [D5] nutscan_ip_ranges_iter_inc: got IP from range: 172.28.67.254
   0.297115     [D4] nutscan_scan_ip_range_snmp: max_threads_scantype=0 curr_threads=1022 thread_count=1022 stwST=-1 stwS=0 pass=1
   0.297190     [D5] nutscan_ip_ranges_iter_inc: got IP from range: 172.28.67.255
   0.297235     [D4] nutscan_scan_ip_range_snmp: max_threads_scantype=0 curr_threads=1023 thread_count=1023 stwST=-1 stwS=0 pass=1
   0.297083     [D2] Entering try_SysOID_thready for 172.28.67.253
   0.297190     [D2] Entering try_SysOID_thready for 172.28.67.254
   0.297351     [D2] Entering try_SysOID_thready for 172.28.67.255
   0.297359     [D5] nutscan_ip_ranges_iter_inc: got IP from range: 172.28.68.0
   0.297396     [D4] nutscan_scan_ip_range_snmp: max_threads_scantype=0 curr_threads=1024 thread_count=1024 stwST=-1 stwS=-1 pass=0
   0.297413     [D2] nutscan_scan_ip_range_snmp: Running too many scanning threads (1024), waiting until older ones would finish
/var/lib/snmp/hosts/172.28.67.165.local.conf: Too many open files
   0.378710     [D2] Failed to open SNMP session for 172.28.65.167: Too many open files
   0.378813     [D2] Failed to open SNMP session for 172.28.65.113: Too many open files
   0.378755     [D2] Failed to open SNMP session for 172.28.67.165: Too many open files
^C