psy0rz / zfs_autobackup

ZFS autobackup is used to periodicly backup ZFS filesystems to other locations. Easy to use and very reliable.
https://github.com/psy0rz/zfs_autobackup
GNU General Public License v3.0
601 stars 63 forks source link

Command "ssh <example.com> 'zfs list <dataset>'" returned exit code 255 (valid codes: [0, 1]) #165

Closed faern closed 2 years ago

faern commented 2 years ago

I recently upgraded zfs-autobackup from version ~3.0 (I can't remember exactly which version, but it was a release candidate) to 3.1.3. And I also upgraded zfs on my source machine at the same time. But since this error is about the target, I think that's irrelevant.

Now I get this error fairly frequently. Not on every run of the script, but a few times per day or something (I run it every hour). I have substituted the domain and name of the dataset:

! [Target] Command "ssh <example.com> 'zfs list <dataset>'" returned exit code 255 (valid codes: [0, 1])
! Exception: Last command returned error

I don't think my remote suddenly started doing this anything differently. So I assume it's zfs-autobackup that is checking the exit code more strictly. The man page for zfs list does not specify exit codes. But ssh exits with code 255 if an error occurred, so I assume it's that. Should maybe zfs-autobackup be able to handle that more gracefully? An error did indeed happen, but I need a way to not get cron daemon error emails because of occasional network errors :thinking:

psy0rz commented 2 years ago

Not getting errors because of occasional network errors might be difficult. :)

However, you can use --ignore-transfer-errors to make it not check the exit code during transfer. It will still verify if the snapshot actually exists on the target.

If you want better monitoring check out https://github.com/psy0rz/zfs_autobackup/wiki/Monitoring#monitoring-example-with-zabbix-jobs

Also: Try this: https://github.com/psy0rz/zfs_autobackup/wiki/Performance#speeding-up-ssh

It usually makes things more reliable and faster.

faern commented 2 years ago

Thank you for the input and ideas! I'll read up on that.

faern commented 2 years ago

Interestingly enough, setting up ControlPath/ControlMaster silenced the errors completely. Knock on wood, but previously I got 3-4 per day, now I have not seen a single such error email in a few days.

psy0rz commented 2 years ago

Yes that was probably it. This also happens if an ssh port is exposed to the internet and is getting hammered by all kinds of bots and scripts. Every so often sshd will then refuse a connection.

(or if sshd or a firewall just things you're reconneting to ssh too often)