netdata / netdata-cloud

The public repository of Netdata Cloud. Contribute with bug reports and feature requests.
GNU General Public License v3.0
41 stars 16 forks source link

[Bug]: Dynamic Configuration Manager: Unable to Assign Existing Monitoring Jobs to New Node #1035

Closed onion83 closed 1 week ago

onion83 commented 1 month ago

Bug description

When using the Dynamic Configuration Manager in Netdata Cloud, encounter an error message "Unknown config id given" when trying to assign existing monitoring tasks to a new node by clicking "Submit to multiple nodes"

iShot_2024-07-15_00 23 15 iShot_2024-07-15_00 23 42

Expected behavior

Success

Steps to Reproduce

  1. Perform a Full New Install of a Node (LXC by Proxmox):

    • Complete a fresh installation of a node using LXC in Proxmox.
  2. Install Using Integrations Auto Shell Script:

    • Use the auto shell script to install kickstart.sh and wait for the node to appear as active in the Netdata Cloud dashboard.
  3. Submit Tasks to Multiple Nodes:

    • Navigate to "Manage Space / Netdata Space / Configurations".
    • Edit an existing job, such as "ping", and attempt to submit it to multiple nodes.

Installation method

kickstart.sh

System info

Linux netdata-sz-ctc 5.15.149-1-pve netdata/netdata#1 SMP PVE 5.15.149-1 (2024-03-29T14:24Z) x86_64 x86_64 x86_64 GNU/Linux
/etc/os-release:NAME="Rocky Linux"
/etc/os-release:VERSION="9.4 (Blue Onyx)"
/etc/os-release:ID="rocky"
/etc/os-release:ID_LIKE="rhel centos fedora"
/etc/os-release:VERSION_ID="9.4"
/etc/os-release:PLATFORM_ID="platform:el9"
/etc/os-release:PRETTY_NAME="Rocky Linux 9.4 (Blue Onyx)"
/etc/os-release:ANSI_COLOR="0;32"
/etc/os-release:LOGO="fedora-logo-icon"
/etc/os-release:CPE_NAME="cpe:/o:rocky:rocky:9::baseos"
/etc/os-release:SUPPORT_END="2032-05-31"
/etc/os-release:ROCKY_SUPPORT_PRODUCT="Rocky-Linux-9"
/etc/os-release:ROCKY_SUPPORT_PRODUCT_VERSION="9.4"
/etc/os-release:REDHAT_SUPPORT_PRODUCT="Rocky Linux"
/etc/os-release:REDHAT_SUPPORT_PRODUCT_VERSION="9.4"
/etc/redhat-release:Rocky Linux release 9.4 (Blue Onyx)
/etc/rocky-release:Rocky Linux release 9.4 (Blue Onyx)
/etc/system-release:Rocky Linux release 9.4 (Blue Onyx)

Netdata build info

Packaging:
    Netdata Version ____________________________________________ : v1.46.2
    Installation Type __________________________________________ : binpkg-rpm
    Package Architecture _______________________________________ : x86_64
    Package Distro _____________________________________________ :
    Configure Options __________________________________________ : dummy-configure-command
Default Directories:
    User Configurations ________________________________________ : /etc/netdata
    Stock Configurations _______________________________________ : /usr/lib/netdata/conf.d
    Ephemeral Databases (metrics data, metadata) _______________ : /var/cache/netdata
    Permanent Databases ________________________________________ : /var/lib/netdata
    Plugins ____________________________________________________ : /usr/libexec/netdata/plugins.d
    Static Web Files ___________________________________________ : /usr/share/netdata/web
    Log Files __________________________________________________ : /var/log/netdata
    Lock Files _________________________________________________ : /var/lib/netdata/lock
    Home _______________________________________________________ : /var/lib/netdata
Operating System:
    Kernel _____________________________________________________ : Linux
    Kernel Version _____________________________________________ : 5.15.149-1-pve
    Operating System ___________________________________________ : unknown
    Operating System ID ________________________________________ : unknown
    Operating System ID Like ___________________________________ : unknown
    Operating System Version ___________________________________ : unknown
    Operating System Version ID ________________________________ : 9.4
    Detection __________________________________________________ : unknown
Hardware:
    CPU Cores __________________________________________________ : 2
    CPU Frequency ______________________________________________ : 2000000000
    RAM Bytes __________________________________________________ : 2147483648
    Disk Capacity ______________________________________________ : 375141883904
    CPU Architecture ___________________________________________ : x86_64
    Virtualization Technology __________________________________ : none
    Virtualization Detection ___________________________________ : systemd-detect-virt
Container:
    Container __________________________________________________ : lxc
    Container Detection ________________________________________ : systemd-detect-virt
    Container Orchestrator _____________________________________ : none
    Container Operating System _________________________________ : Rocky Linux
    Container Operating System ID ______________________________ : rocky
    Container Operating System ID Like _________________________ : rhel centos fedora
    Container Operating System Version _________________________ : 9.4 (Blue Onyx)
    Container Operating System Version ID ______________________ : 9.4
    Container Operating System Detection _______________________ : /etc/os-release
Features:
    Built For __________________________________________________ : Linux
    Netdata Cloud ______________________________________________ : YES
    Health (trigger alerts and send notifications) _____________ : YES
    Streaming (stream metrics to parent Netdata servers) _______ : YES
    Back-filling (of higher database tiers) ____________________ : YES
    Replication (fill the gaps of parent Netdata servers) ______ : YES
    Streaming and Replication Compression ______________________ : YES (zstd lz4 gzip)
    Contexts (index all active and archived metrics) ___________ : YES
    Tiering (multiple dbs with different metrics resolution) ___ : YES (5)
    Machine Learning ___________________________________________ : YES
Database Engines:
    dbengine (compression) _____________________________________ : YES (zstd lz4)
    alloc ______________________________________________________ : YES
    ram ________________________________________________________ : YES
    none _______________________________________________________ : YES
Connectivity Capabilities:
    ACLK (Agent-Cloud Link: MQTT over WebSockets over TLS) _____ : YES
    static (Netdata internal web server) _______________________ : YES
    h2o (web server) ___________________________________________ : YES
    WebRTC (experimental) ______________________________________ : NO
    Native HTTPS (TLS Support) _________________________________ : YES
    TLS Host Verification ______________________________________ : YES
Libraries:
    LZ4 (extremely fast lossless compression algorithm) ________ : YES
    ZSTD (fast, lossless compression algorithm) ________________ : YES
    zlib (lossless data-compression library) ___________________ : YES
    Brotli (generic-purpose lossless compression algorithm) ____ : NO
    protobuf (platform-neutral data serialization protocol) ____ : YES (system)
    OpenSSL (cryptography) _____________________________________ : YES
    libdatachannel (stand-alone WebRTC data channels) __________ : NO
    JSON-C (lightweight JSON manipulation) _____________________ : YES
    libcap (Linux capabilities system operations) ______________ : NO
    libcrypto (cryptographic functions) ________________________ : YES
    libyaml (library for parsing and emitting YAML) ____________ : YES
Plugins:
    apps (monitor processes) ___________________________________ : YES
    cgroups (monitor containers and VMs) _______________________ : YES
    cgroup-network (associate interfaces to CGROUPS) ___________ : YES
    proc (monitor Linux systems) _______________________________ : YES
    tc (monitor Linux network QoS) _____________________________ : YES
    diskspace (monitor Linux mount points) _____________________ : YES
    freebsd (monitor FreeBSD systems) __________________________ : NO
    macos (monitor MacOS systems) ______________________________ : NO
    statsd (collect custom application metrics) ________________ : YES
    timex (check system clock synchronization) _________________ : YES
    idlejitter (check system latency and jitter) _______________ : YES
    bash (support shell data collection jobs - charts.d) _______ : YES
    debugfs (kernel debugging metrics) _________________________ : YES
    cups (monitor printers and print jobs) _____________________ : YES
    ebpf (monitor system calls) ________________________________ : YES
    freeipmi (monitor enterprise server H/W) ___________________ : YES
    nfacct (gather netfilter accounting) _______________________ : NO
    perf (collect kernel performance events) ___________________ : YES
    slabinfo (monitor kernel object caching) ___________________ : YES
    Xen ________________________________________________________ : NO
    Xen VBD Error Tracking _____________________________________ : NO
    Logs Management ____________________________________________ : YES
Exporters:
    AWS Kinesis ________________________________________________ : NO
    GCP PubSub _________________________________________________ : NO
    MongoDB ____________________________________________________ : YES
    Prometheus (OpenMetrics) Exporter __________________________ : YES
    Prometheus Remote Write ____________________________________ : YES
    Graphite ___________________________________________________ : YES
    Graphite HTTP / HTTPS ______________________________________ : YES
    JSON _______________________________________________________ : YES
    JSON HTTP / HTTPS __________________________________________ : YES
    OpenTSDB ___________________________________________________ : YES
    OpenTSDB HTTP / HTTPS ______________________________________ : YES
    All Metrics API ____________________________________________ : YES
    Shell (use metrics in shell scripts) _______________________ : YES
Debug/Developer Features:
    Trace All Netdata Allocations (with charts) ________________ : NO
    Developer Mode (more runtime checks, slower) _______________ : NO

Additional info

No response

kapantzak commented 1 month ago

Thank you @onion83 for reporting! We're investigating in order to fix this soon.

ilyam8 commented 1 month ago

@onion83, hey. Can you try creating a ping job directly on netdata-sz-ctc? You will need to select the node

Screenshot 2024-07-15 at 10 21 45

Unknown config id given

This may indicate that the go.d.plugin (it has ping functionality) is not running on that particular node.

onion83 commented 1 month ago

As shown in the attached video, I recreated a brand-new node, netdata-sz-ctc2, and added it to the dashboard, confirming it is online. The video shows netdata.cloud on the left and the local node on the right.

  1. SSH into netdata-sz-ctc2 and confirm via the ps command that go.d.plugin is running in the system processes.
  2. Create a local task named localtest and confirm its success.
  3. In the netdata.cloud management console, attempt to sync an existing node's (netdata-cmc) apps monitoring task to netdata-sz-ctc2. This failed.
  4. In the local backend of netdata-sz-ctc2, attempt to add a task named apps with the monitoring target 1.1.1.1.
  5. In the netdata-cmc node, use the "Submit to multiple nodes" feature and select netdata-sz-ctc2 as the sync target. This time, the task succeeded.
  6. After refreshing the browser with F5 and editing the apps monitoring task on netdata-sz-ctc2, the monitoring target is now fully synchronized with netdata-cmc.

Therefore, the current bug is: After adding a new node, an empty task with the same name must be created to sync with other nodes (only tested with the ping plugin, other plugins not tested).

Expected:

  1. Automatically create non-existent monitoring tasks during synchronization.
  2. Feature: Use an auto-install script to join the same room and automatically sync all monitoring tasks, avoiding manual configuration and improving operational efficiency.

https://github.com/user-attachments/assets/107211c2-fd1c-45b8-9d9b-e25392fdf517

ilyam8 commented 1 month ago

@onion83, hey. Not related to the issue, but: performance in the "privileged" mode can become less efficient as the number of targets grows. This is because CPU usage scales disproportionately, meaning it increases much faster than the number of targets. That is a bug in the upstream library we use for go.d/ping. See netdata/netdata#15410.

onion83 commented 1 month ago

hey @ilyam8 Please take a look at the title, issue, and video description. This is specifically about the task distribution issue with the Dynamic Configuration Manager and not related to ping values or system permissions 、cpu etc...

ilyam8 commented 1 month ago

I know that, that is why I started with "Not related to the issue".

sashwathn commented 2 weeks ago

@ilyam8 : Is this a bug at the agent side? I don't see why the user needs to create a local job (on the local Agent dashboard) before submitting it to multiple nodes? Or @kapantzak have you identified some issue on the FE side for this?

kapantzak commented 2 weeks ago

@sashwathn I don't see any FE issue here

ilyam8 commented 2 weeks ago

Is this a bug at the agent side?

What is happening:

We need to provide another way to copy dyncfg items from Node to Node, or treat "update" as "add" if there is no existing job.

ilyam8 commented 2 weeks ago

or treat "update" as "add" if there is no existing job.

I think we need this, I will discuss it with @ktsaou when he returns.

ktsaou commented 2 weeks ago

So for all nodes is an update, but for the new nodes it has to be an add.

The solution is to convert an update to an add if the item is not already there?

ilyam8 commented 2 weeks ago

The solution is to convert an update to an add if the item is not already there?

Yes.


cc @onion83

An alternative is to use this workflow:

This will result in "add" - no issues.

https://github.com/user-attachments/assets/d4c102cd-8f38-4ae3-8c73-eb8f45e4e964

ilyam8 commented 1 week ago

@kapantzak hey 👋 We discussed the issue with @ktsaou and suggest the following changes to frontend:

kapantzak commented 1 week ago

Hi @onion83, we released some changes for this that hopefully fixes the issue.

onion83 commented 1 week ago

It works! thank you