netdata / netdata

Architected for speed. Automated for easy. Monitoring and troubleshooting, transformed!
https://www.netdata.cloud
GNU General Public License v3.0
72.15k stars 5.94k forks source link

[Bug]: Spam from apps_group_file_descriptors_utilization #17332

Closed mhbates closed 7 months ago

mhbates commented 7 months ago

Bug description

Recently, starting around 3/28/24, I started getting a lot of these alerts for the "torrents" app group. What's strange is that I've since increased the open file limits in Ubuntu by many factors, and the particular limit for the individual docker container that I believe is triggering it on my end is even higher than the system-wide limits.

The open files limit for the docker container is 1,048,576. The soft limit system-wide is 8192, which is up from the default 1024. As I increase the soft and hard limits system-wide, I am not really seeing any change with Netdata's alerts; they're still getting triggered and still showing the same percentages as before.

image

Expected behavior

It feels like Netdata might be calculating these limits incorrectly and throwing alerts when it shouldn't be.

Steps to reproduce

  1. Update Netdata to version that released around 3/28/24 (I believe)
  2. Use torrenting software in a docker container
  3. Trigger these limit alerts

Installation method

kickstart.sh

System info

Linux ogego 5.4.0-174-generic #193-Ubuntu SMP Thu Mar 7 14:29:28 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
/etc/lsb-release:DISTRIB_ID=Ubuntu
/etc/lsb-release:DISTRIB_RELEASE=20.04
/etc/lsb-release:DISTRIB_CODENAME=focal
/etc/lsb-release:DISTRIB_DESCRIPTION="Ubuntu 20.04.6 LTS"
/etc/os-release:NAME="Ubuntu"
/etc/os-release:VERSION="20.04.6 LTS (Focal Fossa)"
/etc/os-release:ID=ubuntu
/etc/os-release:ID_LIKE=debian
/etc/os-release:PRETTY_NAME="Ubuntu 20.04.6 LTS"
/etc/os-release:VERSION_ID="20.04"
/etc/os-release:VERSION_CODENAME=focal
/etc/os-release:UBUNTU_CODENAME=focal

Netdata build info

Packaging:
    Netdata Version ____________________________________________ : v1.45.2
    Installation Type __________________________________________ : binpkg-deb
    Package Architecture _______________________________________ : x86_64
    Package Distro _____________________________________________ :  
    Configure Options __________________________________________ : dummy-configure-command
Default Directories:
    User Configurations ________________________________________ : /etc/netdata
    Stock Configurations _______________________________________ : /usr/lib/netdata/conf.d
    Ephemeral Databases (metrics data, metadata) _______________ : /var/cache/netdata
    Permanent Databases ________________________________________ : /var/lib/netdata
    Plugins ____________________________________________________ : /usr/libexec/netdata/plugins.d
    Static Web Files ___________________________________________ : /var/lib/netdata/www
    Log Files __________________________________________________ : /var/log/netdata
    Lock Files _________________________________________________ : /var/lib/netdata/lock
    Home _______________________________________________________ : /var/lib/netdata
Operating System:
    Kernel _____________________________________________________ : Linux
    Kernel Version _____________________________________________ : 5.4.0-174-generic
    Operating System ___________________________________________ : Ubuntu
    Operating System ID ________________________________________ : ubuntu
    Operating System ID Like ___________________________________ : debian
    Operating System Version ___________________________________ : 20.04.6 LTS (Focal Fossa)
    Operating System Version ID ________________________________ : none
    Detection __________________________________________________ : /etc/os-release
Hardware:
    CPU Cores __________________________________________________ : 16
    CPU Frequency ______________________________________________ : 2900000000
    RAM Bytes __________________________________________________ : 67266920448
    Disk Capacity ______________________________________________ : 92007918870016
    CPU Architecture ___________________________________________ : x86_64
    Virtualization Technology __________________________________ : none
    Virtualization Detection ___________________________________ : systemd-detect-virt
Container:
    Container __________________________________________________ : none
    Container Detection ________________________________________ : systemd-detect-virt
    Container Orchestrator _____________________________________ : none
    Container Operating System _________________________________ : none
    Container Operating System ID ______________________________ : none
    Container Operating System ID Like _________________________ : none
    Container Operating System Version _________________________ : none
    Container Operating System Version ID ______________________ : none
    Container Operating System Detection _______________________ : none
Features:
    Built For __________________________________________________ : Linux
    Netdata Cloud ______________________________________________ : YES
    Health (trigger alerts and send notifications) _____________ : YES
    Streaming (stream metrics to parent Netdata servers) _______ : YES
    Back-filling (of higher database tiers) ____________________ : YES
    Replication (fill the gaps of parent Netdata servers) ______ : YES
    Streaming and Replication Compression ______________________ : YES (zstd lz4 gzip)
    Contexts (index all active and archived metrics) ___________ : YES
    Tiering (multiple dbs with different metrics resolution) ___ : YES (5)
    Machine Learning ___________________________________________ : YES
Database Engines:
    dbengine ___________________________________________________ : YES
    alloc ______________________________________________________ : YES
    ram ________________________________________________________ : YES
    none _______________________________________________________ : YES
Connectivity Capabilities:
    ACLK (Agent-Cloud Link: MQTT over WebSockets over TLS) _____ : YES
    static (Netdata internal web server) _______________________ : YES
    h2o (web server) ___________________________________________ : YES
    WebRTC (experimental) ______________________________________ : NO
    Native HTTPS (TLS Support) _________________________________ : YES
    TLS Host Verification ______________________________________ : YES
Libraries:
    LZ4 (extremely fast lossless compression algorithm) ________ : YES
    ZSTD (fast, lossless compression algorithm) ________________ : YES
    zlib (lossless data-compression library) ___________________ : YES
    Brotli (generic-purpose lossless compression algorithm) ____ : NO
    protobuf (platform-neutral data serialization protocol) ____ : YES (system)
    OpenSSL (cryptography) _____________________________________ : YES
    libdatachannel (stand-alone WebRTC data channels) __________ : NO
    JSON-C (lightweight JSON manipulation) _____________________ : YES
    libcap (Linux capabilities system operations) ______________ : NO
    libcrypto (cryptographic functions) ________________________ : YES
    libyaml (library for parsing and emitting YAML) ____________ : YES
Plugins:
    apps (monitor processes) ___________________________________ : YES
    cgroups (monitor containers and VMs) _______________________ : YES
    cgroup-network (associate interfaces to CGROUPS) ___________ : YES
    proc (monitor Linux systems) _______________________________ : YES
    tc (monitor Linux network QoS) _____________________________ : YES
    diskspace (monitor Linux mount points) _____________________ : YES
    freebsd (monitor FreeBSD systems) __________________________ : NO
    macos (monitor MacOS systems) ______________________________ : NO
    statsd (collect custom application metrics) ________________ : YES
    timex (check system clock synchronization) _________________ : YES
    idlejitter (check system latency and jitter) _______________ : YES
    bash (support shell data collection jobs - charts.d) _______ : YES
    debugfs (kernel debugging metrics) _________________________ : YES
    cups (monitor printers and print jobs) _____________________ : YES
    ebpf (monitor system calls) ________________________________ : YES
    freeipmi (monitor enterprise server H/W) ___________________ : YES
    nfacct (gather netfilter accounting) _______________________ : YES
    perf (collect kernel performance events) ___________________ : YES
    slabinfo (monitor kernel object caching) ___________________ : YES
    Xen ________________________________________________________ : YES
    Xen VBD Error Tracking _____________________________________ : NO
    Logs Management ____________________________________________ : YES
Exporters:
    AWS Kinesis ________________________________________________ : NO
    GCP PubSub _________________________________________________ : NO
    MongoDB ____________________________________________________ : YES
    Prometheus (OpenMetrics) Exporter __________________________ : YES
    Prometheus Remote Write ____________________________________ : YES
    Graphite ___________________________________________________ : YES
    Graphite HTTP / HTTPS ______________________________________ : YES
    JSON _______________________________________________________ : YES
    JSON HTTP / HTTPS __________________________________________ : YES
    OpenTSDB ___________________________________________________ : YES
    OpenTSDB HTTP / HTTPS ______________________________________ : YES
    All Metrics API ____________________________________________ : YES
    Shell (use metrics in shell scripts) _______________________ : YES
Debug/Developer Features:
    Trace All Netdata Allocations (with charts) ________________ : NO
    Developer Mode (more runtime checks, slower) _______________ : NO

Additional info

No response

mhbates commented 7 months ago

I've been doing more research and it looks like there's a chance that this is a bug with the Transmission torrent client: https://github.com/transmission/transmission/issues/5385

It might just be coincidental that Netdata started reporting this for me recently; perhaps Netdata recently added this alert back in, or my setup with Transmission started to hit a point where it now triggers this bug. I'm going to try rolling Transmission back to 4.0.2 as mentioned in that linked issue and see if that helps.

ilyam8 commented 7 months ago

@mhbates Did you find which process used 100% of its limit (according to Netdata)? You can do it by :

A screenshot from my server:

Screenshot 2024-04-08 at 10 58 47

See 768 uses 20 FDs and it is 1.95%:

$ cat /proc/768/limits | grep "Max open files"
Max open files            1024                 524288               files

=> 20 ÷ 1024 * 100 = 1.95%

netdata-community-bot commented 7 months ago

This issue has been mentioned on the Netdata Community Forums. There might be relevant details there:

https://community.netdata.cloud/t/warning-alert-for-apps-group-file-descriptors-utilization/5278/3

netdata-community-bot commented 4 months ago

This issue has been mentioned on the Netdata Community Forums. There might be relevant details there:

https://community.netdata.cloud/t/how-is-the-value-of-app-sql-fds-open-limit-calculated/5600/2

netdata-community-bot commented 2 months ago

This issue has been mentioned on the Netdata Community Forums. There might be relevant details there:

https://community.netdata.cloud/t/app-sql-fds-open-limit-calculation-issue/5762/4