prometheus: add btrfs_error_exporter example

This is an extension for Prometheus' native node_exporter with a "text collector" (to be called e.g. from cron) that collects btrfs device errors. Unlike its original version (https://git.io/JkTIi) this is written with python-btrfs and therefore does not shell out to btrfs device stats or rely on its output. The original script was written before the native node_exporter had btrfs stats, which it has now - but device errors are still missing, so here we are. I might need to add some licensing and a copyright but wasn't sure if you care..let me know what you think. =)

Btw I noticed that compared to btrfs device stats output, the device read/write/flush errors in python-btrfs are missing the _io_ in the error name. While I personally don't care much, this could nevertheless throw off some people and monitoring systems. Would it be worth to make the names consistent in python-btrfs?

Hi! I try to keep naming as close/consistent/whatever as possible to actual kernel code. I did not notice yet that the progs code has labels that are inconsistent with the naming of the enum in kernel code. The progs code is from 2012, and the commit has no explanation why the _io was inserted in the labels.

The DevStats object is an "Object representation of struct btrfs_ioctl_get_dev_stats" which uses btrfs_dev_stat_values with the names that we also see in progs code.

enum btrfs_dev_stat_values {
        /* disk I/O failure stats */
        BTRFS_DEV_STAT_WRITE_ERRS, /* EIO or EREMOTEIO from lower layers */
        BTRFS_DEV_STAT_READ_ERRS, /* EIO or EREMOTEIO from lower layers */
        BTRFS_DEV_STAT_FLUSH_ERRS, /* EIO or EREMOTEIO from lower layers */

        /* stats for indirect indications for I/O failures */
        BTRFS_DEV_STAT_CORRUPTION_ERRS, /* checksum error, bytenr error or
                                         * contents is illegal: this is an
                                         * indication that the block was damaged
                                         * during read or write, or written to
                                         * wrong location or read from wrong
                                         * location */
        BTRFS_DEV_STAT_GENERATION_ERRS, /* an indication that blocks have not
                                         * been written */

        BTRFS_DEV_STAT_VALUES_MAX
};

/* Reset statistics after reading; needs SYS_ADMIN capability */
#define BTRFS_DEV_STATS_RESET           (1ULL << 0)

struct btrfs_ioctl_get_dev_stats {
        __u64 devid;                            /* in */
        __u64 nr_items;                         /* in/out */
        __u64 flags;                            /* in/out */

        /* out values: */
        __u64 values[BTRFS_DEV_STAT_VALUES_MAX];

        /*
         * This pads the struct to 1032 bytes. It was originally meant to pad to
         * 1024 bytes, but when adding the flags field, the padding calculation
         * was not adjusted.
         */
        __u64 unused[128 - 2 - BTRFS_DEV_STAT_VALUES_MAX];
};

So, since it's e.g. BTRFS_DEV_STAT_WRITE_ERRS and not BTRFS_DEV_STAT_WRITE_IO_ERRS, it ended up without _io in here.

So, since it's e.g. BTRFS_DEV_STAT_WRITE_ERRS and not BTRFS_DEV_STAT_WRITE_IO_ERRS, it ended up without _io in here.

Interesting! In the case of Prometheus this means a previous series would be mismatched and no longer continue with the new data. Essentially the old timeline would end and a new one would begin, since a single series is uniquely identified by all tuple identifiers (called labels). However other than this lack of continuity it's no a big deal; the labels are not standardized and mismatches frequently happen between different exporters. Since Prometheus metrics are mostly short-lived and expire quickly it really shouldn't cause any problems; after all you'd only alert on a series if the value changed to >0 anyway.

Yeah. Well, brrr, the exporter could also include a hard coded map of the counters and have the field names printed in the same way as they are in the progs stats output. However, using the counters convenience thing (which was added for the nagios plugin) of course automatically picks up new counters if they're implemented... Or have some ugly fixup_flapsie lookup table which filters/replaces the known different names and leaves the rest unchanged.

So, pros and cons.

Since there's already a current script at the location you linked, don't you think that we should prepare and test a PR against prometheus-community/node-exporter-textfile-collector-scripts indeed? In that case it should probably behave as similar as possible.

Or, adopt it and provide an alternative here, and then make more changes and improvements, without caring about breaking all the output format compatibility with the old one. I mean, things like not indexing statistics under a device name, but under fsid uuid and deduplicate them, and not have them jumping around to another device name if you do btrfs replace etc etc... (The get_btrfs_mount_points function in the old one does not return mount points, it returns device names O_o)

I think it's a great idea to adopt it and provide a better alternative in here.

Also, the old script has an allocation reporting part, but as you know reporting on usage is a lot more hairy than that. It totally depends on what the use case is of course, but think it is highly likely that the use case is to know if you run out of space or not. When having a better look into doing that for the nagios plugin, I ended up writing the whole fs_usage module to end up with more extensive reporting, also taking unallocatable ('soft' and 'hard') raw disk space into account. So, I'd rather use that and actually have the whole thing be very much like the nagios plugin just with different output format.

How often are the prometheus exporter things called? Because, creating an FsUsage object means running a chunk allocation simulation to try fill up all disks, in order to find out about unallocatable space. This can be a pretty heavy operation on a multi-TiB filesystem. For nagios and munin, where it happens only once every 5 minutes or so, it's not a big deal, but it can't be done every 10 seconds.

So much thinking out loud. Thanks for starting this discussion.

Yeah. Well, brrr, the exporter could also include a hard coded map of the counters [...]

Yeah..no? This PR was supposed to be a) an example for how elegant python-btrfs is and b) do the one thing that's missing from the existing native node_exporter and its btrfs collector, namely collecting the device errors.

Ideally this script shouldn't even exist and be part of node_exporter, but there are problems with that (the ioctl tree search permissions vs. the fact that node_exporter mandates unprivileged permissions at runtime vs. my ~hate~ lack of patience for Go). None of this would exist if the device stats were properly exposed in sysfs and readable, which they somehow aren't. I have this thread bookmarked for that.

Since there's already a current script at the location you linked, don't you think that we should prepare and test a PR against prometheus-community/node-exporter-textfile-collector-scripts indeed? In that case it should probably behave as similar as possible.

No..ish? The old script was contributed as a stopgap measure when the native integration in node_exporter didn't exist yet. It's also been moved to the "community" farm, where it can play with the other outdated and broken scripts. Also the allocation stats are no longer needed (see above), which is why I rewrote the error part only.

Or, adopt it and provide an alternative here, and then make more changes and improvements without caring about breaking all the output format compatibility with the old one. I mean, things like not indexing statistics under a device name, but under fsid uuid and deduplicate them, and not have them jumping around to another device name if you do btrfs replace etc etc... (The get_btrfs_mount_points function in the old one does not return mount points, it returns device names O_o)

What I did so far was just my usual curiosity to see if I (not knowing much Python) can rewrite the thing to not fork/grep/regexp and whatnot. (I learned how to yield..yay.) That worked beautifully - the new code is almost self-explanatory! - and the metrics so far are as close to the original version as possible. You're completely correct that we could change that (like device -> uuid), but IMHO that would create more confusion than it solves. Prometheus is used for alerting, so if a counter for sdX goes up, you get an alert, replace it (or something) and then no longer care. Whether we skip the mountpoint and/or include the fsid is something I'll have to think about. Excluding the device seems weird since you'd then have to find it out manually..?

I think it's a great idea to adopt it and provide a better alternative in here.

But I don't want to expand it - the point was really only to collect errors and nothing else, since all the other statistics are provided by nodeexporter already. Now whether those values are correct is a very different discussion (the Go code just scrapes sysfs, nothing fancy and it's certainly not aware of edge cases). But that's just .. ¯\_(ツ)/¯ for now. Two years ago I even started writing a btrfs_exporter in .. brace yourself .. C++ simply to see how far that would go, and it runs and detects btrfs mountpoints coming/going and serves http and whatnot, but due to $THINGS I got overtaken by the basic stats in node_exporter. With this helper script for errors I'll probably just archive my project since it's kind of pointless.

How often are the prometheus exporter things called? Because, creating an FsUsage object means running a chunk allocation simulation to try fill up all disks, in order to find out about unallocatable space. This can be a pretty heavy operation on a multi-TiB filesystem. For nagios and munin, where it happens only once every 5 minutes or so, it's not a big deal, but it can't be done every 10 seconds.

Yeah I don't think we should do any of that. :-) Prometheus allows rather nice dynamic queries (I even thought about a btrfs heat map display but somehow didn't get to it yet) including edge/level-triggered window functions which one could use to calculate e.g. the velocity of an impending ENOSPC doom. Based on my observations how often people run out of space even without btrfs I've concluded that I'm overthinking all this. :cry: How often exporters are called depends on how you configure them (for granularity). I poll most of my exporters every 1m and some cron-based things (like this one) every 5m. The scripts just dump their output into a tmpfs directory where it is picked up & sent out when the node_exporter is polled. It depends on the velocity of the data and how quickly you want to react to changes for alerting. I'm not a big fan of Prometheus' polling-based architecture (or its bizarre notions of target 'discovery'), but that train has sailed.

Soo..I can haz no new features plz? :smile_cat:

Just to give you an example for the output of the Go-based node_exporter here's a subset (without generic filesystem data) of what my Prometheus stores, with the output of the btrfs_error_exporter mixed in. You can see that it's a happy mix of uuids, devices and whatnot. Positively noteworthy (but cut for brevity here): python-btrfs/btrfs_error_exporter correctly finds only real filesystems, whereas the sysfs-scraping node_exporter cannot distinguish between real mounts and bind mounts, generating a lot more redundant output than necessary. Looking at it now I think I'll add the uuid to the error series as well, just to be consistent with e.g. node_btrfs_device_size_bytes. Whether the mountpoint really belongs there or not..hmpf. IMHO it doesn't hurt.

$curl -s localhost:9100/metrics | grep -v "^#" | grep btrfs
node_btrfs_allocation_ratio{block_group_type="data",mode="single",uuid="14831af0-32d2-40b7-98c9-ca5467910a8c"} 1
node_btrfs_allocation_ratio{block_group_type="data",mode="single",uuid="97629840-2c9a-4d0e-8c71-dca33a63f6ab"} 1
node_btrfs_allocation_ratio{block_group_type="data",mode="single",uuid="d163af2f-6e03-4972-bfd6-30c68b6ed312"} 1
node_btrfs_allocation_ratio{block_group_type="metadata",mode="dup",uuid="d163af2f-6e03-4972-bfd6-30c68b6ed312"} 2
node_btrfs_allocation_ratio{block_group_type="metadata",mode="single",uuid="14831af0-32d2-40b7-98c9-ca5467910a8c"} 1
node_btrfs_allocation_ratio{block_group_type="metadata",mode="single",uuid="97629840-2c9a-4d0e-8c71-dca33a63f6ab"} 1
node_btrfs_allocation_ratio{block_group_type="system",mode="dup",uuid="d163af2f-6e03-4972-bfd6-30c68b6ed312"} 2
node_btrfs_allocation_ratio{block_group_type="system",mode="single",uuid="14831af0-32d2-40b7-98c9-ca5467910a8c"} 1
node_btrfs_allocation_ratio{block_group_type="system",mode="single",uuid="97629840-2c9a-4d0e-8c71-dca33a63f6ab"} 1
node_btrfs_device_size_bytes{device="sdb1",uuid="97629840-2c9a-4d0e-8c71-dca33a63f6ab"} 2.000397868544e+12
node_btrfs_device_size_bytes{device="sdc1",uuid="d163af2f-6e03-4972-bfd6-30c68b6ed312"} 4.000785964544e+12
node_btrfs_device_size_bytes{device="sdd1",uuid="14831af0-32d2-40b7-98c9-ca5467910a8c"} 1.00020383744e+12
node_btrfs_errors_total{device="sdb1",mountpoint="/mnt/data2",type="corruption_errs"} 0
node_btrfs_errors_total{device="sdb1",mountpoint="/mnt/data2",type="flush_errs"} 0
node_btrfs_errors_total{device="sdb1",mountpoint="/mnt/data2",type="generation_errs"} 0
node_btrfs_errors_total{device="sdb1",mountpoint="/mnt/data2",type="read_errs"} 0
node_btrfs_errors_total{device="sdb1",mountpoint="/mnt/data2",type="write_errs"} 0
node_btrfs_errors_total{device="sdc1",mountpoint="/mnt/backup",type="corruption_errs"} 0
node_btrfs_errors_total{device="sdc1",mountpoint="/mnt/backup",type="flush_errs"} 0
node_btrfs_errors_total{device="sdc1",mountpoint="/mnt/backup",type="generation_errs"} 0
node_btrfs_errors_total{device="sdc1",mountpoint="/mnt/backup",type="read_errs"} 0
node_btrfs_errors_total{device="sdc1",mountpoint="/mnt/backup",type="write_errs"} 0
node_btrfs_errors_total{device="sdd1",mountpoint="/mnt/data1",type="corruption_errs"} 0
node_btrfs_errors_total{device="sdd1",mountpoint="/mnt/data1",type="flush_errs"} 0
node_btrfs_errors_total{device="sdd1",mountpoint="/mnt/data1",type="generation_errs"} 0
node_btrfs_errors_total{device="sdd1",mountpoint="/mnt/data1",type="read_errs"} 0
node_btrfs_errors_total{device="sdd1",mountpoint="/mnt/data1",type="write_errs"} 0
node_btrfs_global_rsv_size_bytes{uuid="14831af0-32d2-40b7-98c9-ca5467910a8c"} 4.15137792e+08
node_btrfs_global_rsv_size_bytes{uuid="97629840-2c9a-4d0e-8c71-dca33a63f6ab"} 5.36870912e+08
node_btrfs_global_rsv_size_bytes{uuid="d163af2f-6e03-4972-bfd6-30c68b6ed312"} 5.36870912e+08
node_btrfs_info{label="Backups",uuid="d163af2f-6e03-4972-bfd6-30c68b6ed312"} 1
node_btrfs_info{label="Library",uuid="14831af0-32d2-40b7-98c9-ca5467910a8c"} 1
node_btrfs_info{label="Archive",uuid="97629840-2c9a-4d0e-8c71-dca33a63f6ab"} 1
node_btrfs_reserved_bytes{block_group_type="data",uuid="14831af0-32d2-40b7-98c9-ca5467910a8c"} 0
node_btrfs_reserved_bytes{block_group_type="data",uuid="97629840-2c9a-4d0e-8c71-dca33a63f6ab"} 0
node_btrfs_reserved_bytes{block_group_type="data",uuid="d163af2f-6e03-4972-bfd6-30c68b6ed312"} 0
node_btrfs_reserved_bytes{block_group_type="metadata",uuid="14831af0-32d2-40b7-98c9-ca5467910a8c"} 0
node_btrfs_reserved_bytes{block_group_type="metadata",uuid="97629840-2c9a-4d0e-8c71-dca33a63f6ab"} 0
node_btrfs_reserved_bytes{block_group_type="metadata",uuid="d163af2f-6e03-4972-bfd6-30c68b6ed312"} 0
node_btrfs_reserved_bytes{block_group_type="system",uuid="14831af0-32d2-40b7-98c9-ca5467910a8c"} 0
node_btrfs_reserved_bytes{block_group_type="system",uuid="97629840-2c9a-4d0e-8c71-dca33a63f6ab"} 0
node_btrfs_reserved_bytes{block_group_type="system",uuid="d163af2f-6e03-4972-bfd6-30c68b6ed312"} 0
node_btrfs_size_bytes{block_group_type="data",mode="single",uuid="14831af0-32d2-40b7-98c9-ca5467910a8c"} 3.24278419456e+11
node_btrfs_size_bytes{block_group_type="data",mode="single",uuid="97629840-2c9a-4d0e-8c71-dca33a63f6ab"} 1.779190202368e+12
node_btrfs_size_bytes{block_group_type="data",mode="single",uuid="d163af2f-6e03-4972-bfd6-30c68b6ed312"} 2.063731785728e+12
node_btrfs_size_bytes{block_group_type="metadata",mode="dup",uuid="d163af2f-6e03-4972-bfd6-30c68b6ed312"} 4.294967296e+09
node_btrfs_size_bytes{block_group_type="metadata",mode="single",uuid="14831af0-32d2-40b7-98c9-ca5467910a8c"} 2.147483648e+09
node_btrfs_size_bytes{block_group_type="metadata",mode="single",uuid="97629840-2c9a-4d0e-8c71-dca33a63f6ab"} 2.147483648e+09
node_btrfs_size_bytes{block_group_type="system",mode="dup",uuid="d163af2f-6e03-4972-bfd6-30c68b6ed312"} 3.3554432e+07
node_btrfs_size_bytes{block_group_type="system",mode="single",uuid="14831af0-32d2-40b7-98c9-ca5467910a8c"} 3.3554432e+07
node_btrfs_size_bytes{block_group_type="system",mode="single",uuid="97629840-2c9a-4d0e-8c71-dca33a63f6ab"} 3.3554432e+07
node_btrfs_used_bytes{block_group_type="data",mode="single",uuid="14831af0-32d2-40b7-98c9-ca5467910a8c"} 3.23371597824e+11
node_btrfs_used_bytes{block_group_type="data",mode="single",uuid="97629840-2c9a-4d0e-8c71-dca33a63f6ab"} 1.71242448896e+12
node_btrfs_used_bytes{block_group_type="data",mode="single",uuid="d163af2f-6e03-4972-bfd6-30c68b6ed312"} 2.056659902464e+12
node_btrfs_used_bytes{block_group_type="metadata",mode="dup",uuid="d163af2f-6e03-4972-bfd6-30c68b6ed312"} 3.12737792e+09
node_btrfs_used_bytes{block_group_type="metadata",mode="single",uuid="14831af0-32d2-40b7-98c9-ca5467910a8c"} 7.44390656e+08
node_btrfs_used_bytes{block_group_type="metadata",mode="single",uuid="97629840-2c9a-4d0e-8c71-dca33a63f6ab"} 1.794310144e+09
node_btrfs_used_bytes{block_group_type="system",mode="dup",uuid="d163af2f-6e03-4972-bfd6-30c68b6ed312"} 262144
node_btrfs_used_bytes{block_group_type="system",mode="single",uuid="14831af0-32d2-40b7-98c9-ca5467910a8c"} 65536
node_btrfs_used_bytes{block_group_type="system",mode="single",uuid="97629840-2c9a-4d0e-8c71-dca33a63f6ab"} 212992
node_scrape_collector_duration_seconds{collector="btrfs"} 0.002448091
node_scrape_collector_success{collector="btrfs"} 1
node_textfile_mtime_seconds{file="btrfs_errors.prom"} 1.605051001e+09

Yeah. Well, brrr, the exporter could also include a hard coded map of the counters [...]

Yeah..no? This PR was supposed to be a) an example for how elegant python-btrfs is and b) do the one thing that's missing from the existing native node_exporter and its btrfs collector, namely collecting the device errors.

Aha, I see. I did not know/realize. This totally makes sense.

Ideally this script shouldn't even exist and be part of node_exporter, but there are problems with that (the ioctl tree search permissions vs. the fact that node_exporter mandates unprivileged permissions at runtime vs. my ~hate~ lack of patience for Go). None of this would exist if the device stats were properly exposed in sysfs and readable, which they somehow aren't. I have this thread bookmarked for that.

Aha. I see. I did not know that it requires non-root. And yes, most of the stuff now needs root because of searches and things.

I haven't used prometheus yet, it's on my wishlist, part of my team at work does. So, you mean that it's possible to run this thing as root, but the official exporter would not accept such a change, even if you would program it in go and it would still require root?

Since there's already a current script at the location you linked, don't you think that we should prepare and test a PR against prometheus-community/node-exporter-textfile-collector-scripts indeed? In that case it should probably behave as similar as possible.

No..ish? The old script was contributed as a stopgap measure when the native integration in node_exporter didn't exist yet. It's also been moved to the "community" farm, where it can play with the other outdated and broken scripts. Also the allocation stats are no longer needed (see above), which is why I rewrote the error part only.

Aha. I did not know that.

Or, adopt it and provide an alternative here, and then make more changes and improvements without caring about breaking all the output format compatibility with the old one. I mean, things like not indexing statistics under a device name, but under fsid uuid and deduplicate them, and not have them jumping around to another device name if you do btrfs replace etc etc... (The get_btrfs_mount_points function in the old one does not return mount points, it returns device names O_o)

What I did so far was just my usual curiosity to see if I (not knowing much Python) can rewrite the thing to not fork/grep/regexp and whatnot. (I learned how to yield..yay.) That worked beautifully - the new code is almost self-explanatory! - and the metrics so far are as close to the original version as possible. You're completely correct that we could change that (like device -> uuid), but IMHO that would create more confusion than it solves. Prometheus is used for alerting, so if a counter for sdX goes up, you get an alert, replace it (or something) and then no longer care. Whether we skip the mountpoint and/or include the fsid is something I'll have to think about. Excluding the device seems weird since you'd then have to find it out manually..?

Hooray. Yes, it is very easy to read, and the yield is a nice trick. The python-btrfs lib also uses generators and yield a lot, but mostly because it makes it possible to process an endless stream of data with minimal buffering. In this case, the yield simply saves you making a list, an extra level of indentation with a loop and then returning the list, which makes it a bit easier to read. The calling code can just iterate over it and remains the same.

And yes, device names indeed, so the error stats link to the individual devices. And a device is always part of exactly 1 filesystem, so that's fine.

I think it's a great idea to adopt it and provide a better alternative in here.

But I don't want to expand it - the point was really only to collect errors and nothing else, since all the other statistics are provided by nodeexporter already. Now whether those values are correct is a very different discussion (the Go code just scrapes sysfs, nothing fancy and it's certainly not aware of edge cases). But that's just .. ¯_(ツ)/¯ for now.

Yes, I did not know.

Two years ago I even started writing a btrfs_exporter in .. brace yourself .. C++ simply to see how far that would go, and it runs and detects btrfs mountpoints coming/going and serves http and whatnot, but due to $THINGS I got overtaken by the basic stats in node_exporter. With this helper script for errors I'll probably just archive my project since it's kind of pointless.

Depends on what your goal is. :-) Learn more C++ or just get it over with, haha.

How often are the prometheus exporter things called? Because, creating an FsUsage object means running a chunk allocation simulation to try fill up all disks, in order to find out about unallocatable space. This can be a pretty heavy operation on a multi-TiB filesystem. For nagios and munin, where it happens only once every 5 minutes or so, it's not a big deal, but it can't be done every 10 seconds.

Yeah I don't think we should do any of that. :-)

Ok. And if it's limited to the error counters all of that does not matter. If we would add more functionality later, then it probably makes sense to have multiple little export scripts that each do one thing and are good at it, instead of cramming all of it in one.

Soo..I can haz no new features plz? smile_cat

Sure. This all makes sense. I'll add some code review. Nothing really spectacular. And now I have to setup some prometheus to test it also.

I haven't used prometheus yet, it's on my wishlist, part of my team at work does. So, you mean that it's possible to run this thing as root, but the official exporter would not accept such a change, even if you would program it in go and it would still require root?

Yup, correct. The (now closed) issue for native btrfs usage stats is here (feat. usual suspects ;). The problem was getting the list of devices (requiring root when using btrfs-progs but not via ioctl) and the obvious workaround would be what you once called the "yolo method" aka trying devids in increasing order until you get an error. I think loooking into /sys/fs/btrfs/<fsid>/devices and using those device names (if possible?) could be an acceptable solution in Go; alternatively one would have to implement the ioctl properly. Again, simply having the stats in sysfs under the device would be the easiest way for all this. I'll probably post that to the list again in the next days. Non-root for node_exporter (in fact all official exporters) is mandatory because they are accessed over http, providing a potential security hole/attack vector, with the possibility for RCEs. I understand the rationale but also don't fully buy that argument, but whatever. Running the script as root via cron into a shared directory is not exploitable over the network, and allows for host-local administrative control.

Sure. This all makes sense. I'll add some code review. Nothing really spectacular. And now I have to setup some prometheus to test it also.

I'll add an author notice and some licensing header like in the other examples. Thanks & have fun :)

I don't mean to be pushy, but if you wait much longer the Go-based exporter is going to steal your thunder, and we can't have that.. ;)

The latest version of the btrfs collector in node_exporter no longer requires elevated privileges and has working error stats, so this is no longer necessary.

knorrie / python-btrfs

prometheus: add btrfs_error_exporter example #26