prometheus / node_exporter

Exporter for machine metrics
https://prometheus.io/
Apache License 2.0
11.18k stars 2.36k forks source link

ZFS collector uses different names for the same metric on FreeBSD vs Linux #1945

Open paxswill opened 3 years ago

paxswill commented 3 years ago

Host operating system: output of uname -a

node_exporter version: output of node_exporter --version

I'm using the packaged versions for FreeBSD 12 (latest) and Ubuntu 20.10

node_exporter, version 1.0.1 (branch: release-1.0, revision: 0)
  build user:       root
  build date:
  go version:       go1.15.6
node_exporter, version 1.0.1+ds (branch: debian/sid, revision: 1.0.1+ds-1)
  build user:       pkg-go-maintainers@lists.alioth.debian.org
  build date:       20200618-22:00:01
  go version:       go1.14.4

node_exporter command line flags

FreeBSD: --web.listen-address=172.17.3.5:9100 --collector.textfile.directory=/var/tmp/node_exporter --collector.zfs Linux: (none, using defaults)

Are you running node_exporter in Docker?

No

What did you do that produced an error?

Ran node_exporter, collecting ZFS metrics on both Linux and FreeBSD

What did you expect to see?

Metrics would have the same name (I think that's the right term?) from both nodes.

What did you see instead?

The names of metrics exported from FreeBSD are different from the ones from Linux.

Extra Comments

The FreeBSD system is more precisely running TrueNAS, but it's close enough to FreeBSD that it doesn't make a big difference. The FreeBSD system is using OpenZFS 2.0, while the Linux system is using ZFSonLinux 0.8.4 (same project, they're skipping 1.0 while unifying the different OSs to a single codebase), so they're close enough that it shouldn't make a difference.

The FreeBSD ZFS collector is modifying the names that it's reading from sysctl, while the Linux ZFS collector seems to be basically dumping the contents of /proc/spl/kstat/zfs/*, munging the names so that they fit the proper naming scheme.

discordianfish commented 3 years ago

I agree, that's quite unfortunate. Initially we always just dumped pretty much raw names from proc but changed that eventually (https://github.com/prometheus/node_exporter/issues/150). The ZFS collector wasn't change. Unfortunately changing it now will be a breaking change. Still, something we should probably do. @SuperQ wdyt?

SuperQ commented 3 years ago

Yes, I never liked the dumping of random proc contents with no filtering. This leads to random breakage.

conallob commented 1 year ago

To minimise this breaking change, how about cleaning up the references, but also providing a flag to control whether the old names should be published or not?

--publish-lagecy-names can default to true, allowing the cleanup to be a no-op immediately. In a future release, the flag can be disabled and eventually retired.

discordianfish commented 1 year ago

Yeah I was thinking about some sort of 'translation' layer to support that for all collectors. So if you come up with a clean pattern for doing this, that would make sense. Beside that, we provided recording rules for breaking changes in the past.

conallob commented 1 year ago

How about:

  1. Flag guard the logic that enables the new, cleaned up named. Default it to false
  2. Provide rewrite recording rules and declare future date, to give users enough notice
  3. Once future date arrives, flip the flag to true.
  4. Future Date + X months, remove the feature flag and legacy naming

Although it takes a while to complete all the states, I've used this pattern quite a lot professionally.

SuperQ commented 1 year ago

@conallob I don't like maintaining those recording rules. IMO it's better for handing the changes in PromQL downstream with queries like sum(new_metric_name) or sum(old_metric_name). This way the user doesn't have to store two copies of the data.

Simplified timeline:

  1. Flag guard the logic that enables the new, cleaned up named. Default it to false
  2. Next major release (currently 2.0.0) we flip the flag to true and drop the old metris.