kdave / btrfsmaintenance

Scripts for btrfs maintenance tasks like periodic scrub, balance, trim or defrag on selected mountpoints or directories.
GNU General Public License v2.0
897 stars 79 forks source link

Add script to send mail in case btrfs issues were detected #107

Open ximion opened 2 years ago

ximion commented 2 years ago

Hi! This PR adds an extremely basic script that just runs btrfs device stats --check on all btrfs filesystems every hour and sends an email to a user-defined address (most likely root in 90% of all cases) in case any issues were found. This should very much work like the mdadm daemon feature that also sends mail in case one of the RAID members is about to fail.

A feature like this can be very useful for smaller setups where the admin still would like to receive an email in case a disk in a btrfs RAID array fails. This also is likely the billionth time someone has written such a script, so putting a version in one place where it can be shared and improved seemed like a good idea, and btrfsmaintenance seems to be the perfect place to add such a feature.

Thanks for considering this PR!

eku commented 2 years ago

I suggest a cron job, cause cron knows how to send mails.

MAILTO=admin@myserver.com
@hourly /sbin/btrfs device stats /data | grep -vE ' 0$'
ximion commented 2 years ago

I suggest a cron job, cause cron knows how to send mails.

MAILTO=admin@myserver.com
@hourly /sbin/btrfs device stats /data | grep -vE ' 0$'

Doing that would result in:

So, I still see good reasons to have the extra script for this :-)

sten0 commented 2 years ago

I suggest a cron job, cause cron knows how to send mails.

Which cron implementation can do this without an MTA? When I investigated this, I discovered that Fedora now appears to log cron output to syslog (and by now, maybe journald rather than syslog) rather than piping output to the MTA; using journald might also be problematic, because not all systems have adequate persistent journal retention policies. I like the idea of using a file (/run/btrfs-issue-mail-sent), and I wonder if this idea could be extended. @ximion, what do you think about the following approach (pros, cons, etc):

Poll btrfs stats on an hourly basis, and dump it to a file. Limit notification emails similarly to the logic you've proposed, but send a follow up email if the rate of errors rapidly increases.

The reason I wonder about this approach is because of the following case: One disk is begins to fail rapidly, and the rate of failed reads (or failed writes) is increasing hour by hour. Meanwhile, the firmware lies about SMART data while claiming everything is fine.

It also seems like having a file with regularly updated stats could be used to enable desktop notifications, albeit in another project, since this seems out of scope for btrfsmaintenance. Btrfs dev stats are "updated during filesystem [mount] lifetime" in addition to "from a scrub run" (btrfs-device(8)), which is why I think this approach may have value :-)

sten0 commented 2 years ago

Oh, and here are the citations for the Fedora case: https://fedoraproject.org/wiki/Changes/NoDefaultSendmail#Detailed_Description https://fedoraproject.org/wiki/Changes/NoDefaultSendmail#Release_Notes

ximion commented 2 years ago

In general I think those are good ideas, and the case of errors rapidly increasing on a disk actually appears to be relatively common - on our systems once a disk is starting to fail, I can pretty much bet on this behavior. This would need a script that's a lot more complex than the proposal here though, and I have to say that the idea of just writing a btrfs maintenance daemon that's lightweight and running all the time did cross my mind :-D The btrfs commands pretty much all have nice JSON output that such a daemon could parse to perform the appropriate actions, be it sending an email, writing a log message or sending a message to a desktop environment (but for that case, having a feature like that in udisks is likely the better spot). Major drawback of this is that such a tool would have to be written and maintained in the first place ^^

karlmistelberger commented 2 years ago

Why not use mail instead of sendmail? See the following fragment from unit packagekit-background.service

this is when something useful was done

if [ $PKCON_RETVAL -ne 5 ]; then

send email

    if [ -n "$MAILTO" ]; then
            mail -Ssendwait -s "System updates available: $SYSTEM_NAME" $MAILTO < $PKTMP
    else
            # default behavior is to use cron's internal mailing of output from cron-script
            cat $PKTMP
    fi

fi

AuHau commented 2 years ago

Small suggestion. It would be a good idea if there would be some test path to validate that everything is set up correctly and that I will indeed get the email notification when something goes wrong. Similarly like SMART has the -M test flag.

But otherwise, this is very much needed for me so thanks a lot for this PR! Hopefully this will be merged 👍

sten0 commented 2 years ago

In general I think those are good ideas, and the case of errors rapidly increasing on a disk actually appears to be relatively common - on our systems once a disk is starting to fail, I can pretty much bet on this behavior.

Thanks. I imagine it's stuff you've already thought of, of course ;) I'm encouraged to hear that this failure mode is common, because common problems of sufficient severity make something work towards a solution pragmatically useful.

This would need a script that's a lot more complex than the proposal here though, and I have to say that the idea of just writing a btrfs maintenance daemon that's lightweight and running all the time did cross my mind :-D The btrfs commands pretty much all have nice JSON output that such a daemon could parse to perform the appropriate actions, be it sending an email, writing a log message or sending a message to a desktop environment (but for that case, having a feature like that in udisks is likely the better spot). Major drawback of this is that such a tool would have to be written and maintained in the first place ^^

Yes, definitely, and there was upstream thread that indicates a need for it:

Zygo Blaxell proposes an autodefrag daemon here: https://www.spinics.net/lists/linux-btrfs/msg122168.html Qu Wenruo supports the idea here: https://www.spinics.net/lists/linux-btrfs/msg122170.html

And a user (Ghislain Adnet) requests what this PR solves here: https://www.spinics.net/lists/linux-btrfs/msg110798.html

I find Adnet's request interesting because this would be where a future btrfsd could initiate a replace from hot spare, or rebalance to higher raid1c$redundancy level to defend against the rapidly increasing errors failure mode (ie: it's probable that two disks in the volume are from the same batch, and if one is failing, another may soon begin to fail).

sten0 commented 2 years ago

/\ @ximion

rjlasko commented 2 years ago

Agree that an email-on-error service should be added. ZFS supports this behavior, for any preinstalled mail service, via zed configuration.

clickwir commented 1 year ago

FWIW, we've been using 'sendemail' for many years. It's still a dependency, but a much lighter one.

The actual mail server runs elsewhere, no need to have every system be it's own mail server.

On Sat, Mar 19, 2022 at 1:40 PM Matthias Klumpp @.***> wrote:

@.**** commented on this pull request.

In btrfs-errmail.sh https://github.com/kdave/btrfsmaintenance/pull/107#discussion_r830519911 :

+then

  • no email set, nothing to do for us

  • exit 0 +fi
  • +BTRFS_STATS_MOUNTPOINTS=$(expand_auto_mountpoint "auto") +OIFS="$IFS" +IFS=: +for MM in $BTRFS_STATS_MOUNTPOINTS; do

  • if ! is_btrfs "$MM"; then
  • echo "Path $MM is not btrfs, skipping"
  • continue
  • fi
  • devstats=$(btrfs device stats --check $MM 2>&1)
  • if [ $? -ne 0 ]; then
  • mail_body="$(sendmail -t <<EOF

Sendmail would obviously have to be a dependency of this. I changed the code so in case an email location was set but sendmail wasn't installed, the script will fail and print a warning to stderr.

— Reply to this email directly, view it on GitHub https://github.com/kdave/btrfsmaintenance/pull/107#discussion_r830519911, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACQSRYS7ZSEJPGWN3KRBH3TVAYUTFANCNFSM5RDJO4IA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you are subscribed to this thread.Message ID: @.***>

ximion commented 1 year ago

/\ @ximion

Do you know if any progress has been made on the "btrfsd" front?

sten0 commented 1 year ago

Matthias Klumpp @.***> writes:

/\ @ximion

Do you know if any progress has been made on the "btrfsd" front?

I haven't heard anything further. If boot environment handling is within the ideal scope of "btrfsd", then maybe grub-btrfsd could be grown into a general-purpose maintenance btrfsd? But maybe that's too much of a stretch...

https://github.com/Antynea/grub-btrfs

If future btrfsd would does boot environment handling, then it will probably need to support systemd-boot. I wonder if this chicken/egg problem isn't going to be solved until someone from Fedora implements something, and then it becomes defacto standard.

ximion commented 1 year ago

I'm working on a thing (called btrfsd for now because I don't have a better name...) which will basically be a small binary called by a systemd timer to perform actions like btrfsmaintenance does, but likely a bit more basic, and scratch my particular itch about mail sending and syslog-message-writing, because this patch apparently won't be merged anytime soon. No ETA on this thing yet though, as I am drowning in work a bit and this will be a "when time permits" kind of project. grub-btrfs looks super cool! Probably does make sense being its own project though (consolidating all tools would ease maintenance a bit, but would also require the maintainers to be familiar with every aspect of the software...)

sten0 commented 1 year ago

I'm working on a thing (called btrfsd for now because I don't have a better name...) which will basically be a small binary called by a systemd timer to perform actions like btrfsmaintenance does, but likely a bit more basic, and scratch my particular itch about mail sending and syslog-message-writing, because this patch apparently won't be merged anytime soon. No ETA on this thing yet though, as I am drowning in work a bit and this will be a "when time permits" kind of project.

Thank you, much appreciated! Please CC me news.

grub-btrfs looks super cool! Probably does make sense being its own project though (consolidating all tools would ease maintenance a bit, but would also require the maintainers to be familiar with every aspect of the software...)

🙂 and fair point; I guess that means there's still a need for distribution maintainers to do this work themselves!

ximion commented 1 year ago

Thank you, much appreciated! Please CC me news.

I actually had some time to work on this, and tiny Btrfsd is born :-) I am currently testing it on my computer and a server, and if things work out well, make the tool available in Debian as well. It is not as extensive as btrfsmaintenance and will probably only ever support stats/scrub/balance, but it has some nice features (like sending mail on errors, and more mails if errors increase, or only running scrub/balance if the system is not running on battery power). Maybe you'll like it, and others find it useful too :-)