netdata / netdata-cloud

The public repository of Netdata Cloud. Contribute with bug reports and feature requests.
GNU General Public License v3.0
41 stars 16 forks source link

[Feat]: ability to "expand by" or similar for charts (e.g. expand by container, mount point etc) #894

Open andrewm4894 opened 1 year ago

andrewm4894 commented 1 year ago

Problem

Some users have expressed a preference for the old agent dashboard approach of a chart per container etc, so they can easily by default see metrics split out a bit more by something that makes sense .e.g by container or mount point etc.

background:

Description

Some options:

A. some sort of "expand by" or "split by" that easily just breaks the chart back out into its constituent pieces B. specific point solutions around custom dashboards. For example a custom dashboard specifically for each container etc. that is also in some way dynamic if you add more container views.

Importance

really want

Value proposition

  1. better and easier visualizations for users

Proposed implementation

TBD

andrewm4894 commented 1 year ago

ability to set and overwrite chart defaults at space and room level could be a partial solution here too

e.g. a user could just set default group by for various container charts to be "by cgroup"

https://github.com/netdata/netdata-cloud/issues/789

netdata-community-bot commented 10 months ago

This issue has been mentioned on the Netdata Community Forums. There might be relevant details there:

https://community.netdata.cloud/t/6-years-experience-but-can-not-use-netdata/4994/2

hugovalente-pm commented 10 months ago

@Pingger I saw you discussing this feature in some places, are you willing to jump on a call with me (PM from Netdata) and our Product Designer to better discuss your use-case and expectations?

Pingger commented 10 months ago

@hugovalente-pm While getting on call, might be difficult, here are a few relevant things:

The following mainly boils down to:

Containers/Cgroups:

  • the nodes are specilised to run specific stuff:
    • Webservers
    • (expected to be) low cpu
    • low ram
    • high network
    • Databases
    • low cpu
    • medium to high ram
    • low network
    • Git/CI
    • low cpu with high spikes
    • low ram with high spikes
    • low network
    • Game-Servers
    • depending on game cpu/ram
    • medium network
    • tor-nodes
    • high cpu
    • medium ram
    • high network
    • DNS-Server (can wrapped into webservers)
    • low cpu
    • low ram
    • low network
    • Backup infrastructure
    • low cpu
    • low ram
    • SHITLOAD of network
  • Everything in those groups is inside a linux container/cgroup
  • Those types/groups I would like to be displayed distinct from each other, so I can, without having to change anything upon loading the page, compare datebases to databases, webservers to webservers and definitely NOT WebServers to gameservers.
    • For that it would be nice to be able to flag containers/cgroups (like you can add custom netdata flags for netdata instances)
    • Alternatively to just be able to use the WebUI, WITHOUT A CLOUD ACCOUNT!, to configure how to split groups apart.
  • Also I'd like to have a similar graph for networking and not a "Total" gauge, that just sums the traffic. image image
  • There was already some improvement on this graph, but it is still somewhat confusing ... note the "13 more values" on 2 of the graphs and "11 more values" on the other one. That makes no sense.
  • adding netdata to each and every container is not feasable, as that overhead would add up. netdata is very resource friendly, but the netdata idle I observed is 10-50% of a core (average of 25%). multiply by sometimes more than 20 containers and you see quite an impact. image

Similar issues:

  • Systemd services are currently ALL merged into a singular graph. The usefulness of that singular graph is exactly 0 image
    • How many services are active? (in the screenshot 4? Of the few hundreds that are actually running across all nodes?!) Picture of a single node of many feeding that graph
    • The CPU/RAM/...-Graphs for the systemd-units have the same issue the cgroups have.
    • systemd base unit files should all be low cpu/ram/net...
    • some service e.g. the vpn-client are instead low cpu/ram, but high net
    • Some graphs just don't show some information for no apparent reason: image
    • I would like to group the systemd units in a similar manner to the containers
  • (Hard-)Drives and mount-points: A summary graph is fine, but I'd also like to have each drive/partition/volume by itself.
  • Network interfaces: same, but in addition in the summary graphs an up and down listing would be nice like the cgroups have for CPU and RAM

Other issues I have noticed:

  • the health notifications sent to root via the system mail command ignore the delay rules and fire instantly instead, sometimes causing quite a spam of mails.
  • dbengine tiers VERY often loose data for entire weeks or even months! (which is why I disabled those)
  • A way to configure to always default to "force play" in the time selector at the top.
  • The Dashboard pausing while hovering a graph is just plain annoying and should also be configurable
  • health-configs can't be properly debugged. There is no apparent log or method, why a specific alarm doesn't register with a chart, or whether there are syntax errors.
  • plugin configs should all be by themselves! (e.g. cgroups is located in the netdata.conf, while go has its own config); netdata.conf needs to be responsible only for netdata and for every tiny subsetting for plugins! On a clean installation it is 741 lines ... most of it being the proc-plugin, with commented out settings, that should be put into a proc.d folder instead.

I'll try to keep this comment updated and with a changelog as issues/ideas arise, for the coming week or so.

Changelog: