[Feat]: ability to "expand by" or similar for charts (e.g. expand by container, mount point etc)

andrewm4894 commented 1 year ago

Problem

Some users have expressed a preference for the old agent dashboard approach of a chart per container etc, so they can easily by default see metrics split out a bit more by something that makes sense .e.g by container or mount point etc.

background:

Description

Some options:

A. some sort of "expand by" or "split by" that easily just breaks the chart back out into its constituent pieces B. specific point solutions around custom dashboards. For example a custom dashboard specifically for each container etc. that is also in some way dynamic if you add more container views.

Importance

really want

Value proposition

better and easier visualizations for users

Proposed implementation

TBD

andrewm4894 commented 1 year ago

ability to set and overwrite chart defaults at space and room level could be a partial solution here too

e.g. a user could just set default group by for various container charts to be "by cgroup"

https://github.com/netdata/netdata-cloud/issues/789

netdata-community-bot commented 10 months ago

This issue has been mentioned on the Netdata Community Forums. There might be relevant details there:

https://community.netdata.cloud/t/6-years-experience-but-can-not-use-netdata/4994/2

hugovalente-pm commented 10 months ago

@Pingger I saw you discussing this feature in some places, are you willing to jump on a call with me (PM from Netdata) and our Product Designer to better discuss your use-case and expectations?

Pingger commented 10 months ago

@hugovalente-pm While getting on call, might be difficult, here are a few relevant things:

I am demonstrating based on my private setup (https://netdata.iskariot.info), but also manage a bigger one for the company I'm working on.
I manage multiple servers and netdata nodes on those servers
Those Nodes should be able to be put into groups in the left pane, without a cloud account!
- e.g. by the defined tags in the netdata.conf or a specific one like "dashboard-groups"
- at the moment my private "servers" (2 root-servers and a NAS) are shown in the same graphs as my notebooks, which leads to confusing information about a server suddenly having wifi until you realize, that a notebook got mixed in again
I also use netdata on my private Devices (basically everything linux I have, uses netdata)
All netdata instances feed into a singular "big" netdata-instance, that holds the stats for the previous ~14 months. (atm ~10GiB dbengine; The "tiers"-update cost me my database!)

The following mainly boils down to:

It is very clunky to get the graphs to be filtered the way you want
and the filters don't persist
I need more than 1 filter preset / a graph for each filter I want
Because of those reasons, I have started to write my own dashboard (which is still very early in its infancy and thus has quite a lot of hardcoding going on...)

Containers/Cgroups:

the nodes are specilised to run specific stuff:

Webservers

(expected to be) low cpu

low ram

high network

Databases

low cpu

medium to high ram

low network

Git/CI

low cpu with high spikes

low ram with high spikes

low network

Game-Servers

depending on game cpu/ram

medium network

tor-nodes

high cpu

medium ram

high network

DNS-Server (can wrapped into webservers)

low cpu

low ram

low network

Backup infrastructure

low cpu

low ram

SHITLOAD of network

Everything in those groups is inside a linux container/cgroup

Those types/groups I would like to be displayed distinct from each other, so I can, without having to change anything upon loading the page, compare datebases to databases, webservers to webservers and definitely NOT WebServers to gameservers.

For that it would be nice to be able to flag containers/cgroups (like you can add custom netdata flags for netdata instances)

Alternatively to just be able to use the WebUI, WITHOUT A CLOUD ACCOUNT!, to configure how to split groups apart.

Also I'd like to have a similar graph for networking and not a "Total" gauge, that just sums the traffic.

There was already some improvement on this graph, but it is still somewhat confusing ... note the "13 more values" on 2 of the graphs and "11 more values" on the other one. That makes no sense.

adding netdata to each and every container is not feasable, as that overhead would add up. netdata is very resource friendly, but the netdata idle I observed is 10-50% of a core (average of 25%). multiply by sometimes more than 20 containers and you see quite an impact.

Similar issues:

Systemd services are currently ALL merged into a singular graph. The usefulness of that singular graph is exactly 0

How many services are active? (in the screenshot 4? Of the few hundreds that are actually running across all nodes?!)

The CPU/RAM/...-Graphs for the systemd-units have the same issue the cgroups have.

systemd base unit files should all be low cpu/ram/net...

some service e.g. the vpn-client are instead low cpu/ram, but high net

Some graphs just don't show some information for no apparent reason:

I would like to group the systemd units in a similar manner to the containers

(Hard-)Drives and mount-points: A summary graph is fine, but I'd also like to have each drive/partition/volume by itself.

Network interfaces: same, but in addition in the summary graphs an up and down listing would be nice like the cgroups have for CPU and RAM

Other issues I have noticed:

the health notifications sent to root via the system mail command ignore the delay rules and fire instantly instead, sometimes causing quite a spam of mails.

dbengine tiers VERY often loose data for entire weeks or even months! (which is why I disabled those)

A way to configure to always default to "force play" in the time selector at the top.

The Dashboard pausing while hovering a graph is just plain annoying and should also be configurable

health-configs can't be properly debugged. There is no apparent log or method, why a specific alarm doesn't register with a chart, or whether there are syntax errors.

plugin configs should all be by themselves! (e.g. cgroups is located in the netdata.conf, while go has its own config); netdata.conf needs to be responsible only for netdata and for every tiny subsetting for plugins! On a clean installation it is 741 lines ... most of it being the proc-plugin, with commented out settings, that should be put into a proc.d folder instead.

I'll try to keep this comment updated and with a changelog as issues/ideas arise, for the coming week or so.

Changelog:

2024-01-11 21:31 fixed typos, reordered a few points, because my jumping around while writing didn't help readability
2024-01-12 11:22 Added health-config debugging note
2024-01-12 17:30 Added netdata.conf size and inconsistency grievances

netdata / netdata-cloud