Allow for a few retries?

sevimo123 commented 2 years ago

Hi, I find vee-mail to be a great tool to augment Veeam on Linux. I noticed though that for some reason once in a while the script fails. Then I get a warning in Veeam log (as post-script failed to execute) and no email to even know about it. It is most likely some intermittent issue either with SMTP server or something else as I can get the email just fine if I rerun the script manually later.

So I though it would be great if the script would support retries to deal with intermittent issues. Something like specifying number of retries and delay between retries in the config, and then the script attempts to do exactly the same thing as it does now, but if it fails for any reason it will retry a few times as per config.

grufocom commented 2 years ago

hi sevimo123, do you use curl or sendmail for sending emails? maybe sending an email is not the problem at this point, I think that if you get the warning in the veeam log then the script fails at another stage.

the script forks itself and waits for 30 seconds to finish the veeam job, if the veeam job was not finished during this time it will fail.

please try to set the sleep time in line 77 from vee-mail.sh script to 60 or more!

sevimo123 commented 2 years ago

I use curl. Why script needs to wait for veeam job to finish, isn't it supposed to be called when the job is already finished? Veeam unfortunately doesn't provide any additional info on failure, this is all I am getting in the job log:

2022-07-26 09:22:34 UTC [warn] Failed to execute post-job script 2022-07-26 09:22:34 UTC [warn] Processing finished with warnings at 2022-07-26 09:22:34 UTC

I'll try to increase sleep time in the script itself, we'll see how it goes. It doesn't happen very often (once a month?), so it could be hard to see if it actually worked or not.

grufocom commented 2 years ago

you would have to ask veeam about that, the fact is that the post script is executed while the dataset has not yet landed in the database - therefore the script still has to wait for it. with the delay of 30 seconds it has always worked well so far, but I could imagine that exactly this is a problem for you and therefore the 60 seconds could help.

sevimo123 commented 2 years ago

Interesting, thanks for the info. I changed the delay to 120s even, there is no rush there, we'll see if I still see these intermittent fails. Will close the comment for now, thanks again!

sevimo123 commented 2 years ago

Well, that didn't take long. Today I've got the same issue with post-script failing even though I put in a delay of 120s. Any ideas as to what could be happening there?

grufocom commented 2 years ago

the question is how veeam recognises that the script is failing. the mail script always exits with an "exit" and only if it is not running as root does it return an "exit 1".

since the script restarts itself after starting and then waits, the post script should immediately report back to veeam and veeam should finish its backup.

so vee-mail should always return an ok!

so the post-script error message can't really depend on the output of the vee-mail script, unless it is sometimes not started as root user - which is rather unlikely.

sevimo123 commented 2 years ago

That's what makes it so weird. I am going to put some rather ridiculous delay (will start with 3600s) and see if failures disappear.

EDIT: Did some more digging, and it seems that script times out, and that's supposed to be the part that only does forking into background. Veeam seems to wait for 600s for post-scripts, then it considers these failing:

[01.08.2022 05:13:18.163] <140201739749120> lpbcomm| Executing custom script: [/opt/vee-mail/vee-mail.sh]. Arguments: [] [01.08.2022 05:13:18.163] <140201739749120> lpbcomm| Wait child process for finish. [01.08.2022 05:13:18.164] <140201719928576> | Thread started. Role: 'script output redirector to buffer', thread id: 140201719928576, parent id: 140201739749120. [01.08.2022 05:13:18.164] <140201711535872> | Thread started. Role: 'script output redirector to log', thread id: 140201711535872, parent id: 140201739749120. [01.08.2022 05:13:18.164] <140201703143168> | Thread started. Role: 'Pid waiter. timeout: 600', thread id: 140201703143168, parent id: 140201739749120. [01.08.2022 05:23:18.164] <140201739749120> lpbcomm| Waiting of PID [18696] is timed out. [01.08.2022 05:23:18.164] <140201739749120> lpbcomm| Process group [18696] was killed. [01.08.2022 05:23:18.165] <140201703143168> | Thread finished. Role: 'Pid waiter. timeout: 600'. [01.08.2022 05:23:18.165] <140201711535872> | Thread finished. Role: 'script output redirector to log'. [01.08.2022 05:23:18.395] <140201719928576> | Thread finished. Role: 'script output redirector to buffer'. [01.08.2022 05:23:18.396] <140201739749120> vmb | [SessionLog][warn] Failed to execute post-job script. [01.08.2022 05:23:18.402] <140201739749120> lpbcore| ERR |Failed to execute post job script with status [success].

sevimo123 commented 2 years ago

OK, I think I figured what the issue is, or rather issues ;)

First of all, if the idea was that vee-mail is not supposed to fail as far as Veeam is concerned (i.e. only do fork into background and always succeed in the parent), then currently this is not necessarily working. There is a bunch of code that is executed out there before forking, and it can fail (which is happening in my case). Before the actual fork

nohup $0 --bg >/dev/null 2>/dev/null &

there is a bunch of code that sets up some environment variables. This is non-trivial code (including calls to sqlite and free space calculations), and I don't think there is any reason for this code to be executed before the fork, it doesn't seem to be doing anything useful in there (it will get re-executed in the child process). So the fork needs to be moved way up (pretty much the first thing out there) - unless I am missing something.

Now, the part that is actually failing in the code before the fork is actually free space calculations. These are currently done using df -hP and then grepping for the TARGET or MPOINT. This doesn't always work in my case ;) I have couple of dozens of FUSEd mounts (e.g. rclone), and global df can take a very long time, especially considering it is repeated several times. So Veeam times out the script after 600 seconds, because it is stuck in df. But well, there is no reason in doing df -hP | grep "$TARGET$" when you can simply do df -hP $TARGET. This returns the same result, and does it much faster as it doesn't need to query all mounts out there that are not relevant.

grufocom commented 2 years ago

I integrated what you suggested - please try if that works for you!

sevimo123 commented 2 years ago

Almost ;) df doesn't take regular expressions, just straight up paths, so instead of df -hP "$TARGET$" it should be just df -hP "$TARGET" (same for $MPOINT). Otherwise, instead of /mnt/path/to/dir df looks for /mnt/path/to/dir$ and it is not there. Also, need to skip the df header, so can do tail -1.

Bottom line is, df -hP "$TARGET$"|awk... needs to change to df -hP "$TARGET"|tail -1|awk... (and the same for $MPOINT)

Apart from that, forking and not getting stuck on all these dfs seem to work now!

grufocom commented 2 years ago

without regex target would match more than one time if there are more than one mountpoints with target in their patch

sevimo123 commented 2 years ago

Not sure what you mean, you provide target (a path) to df, df reports free space for a filesystem that the target path belongs to. A single path can only belong to one filesystem, so df PATH can only return one entry. This is different from grep on all df results (how it was implemented before), there you needed to differentiate /path/to from /path/to/dir. This doesn't happen if you just provide /path/to to df directly, it always returns the right result (of which there can be only one).

sevimo123 commented 1 year ago

Unfortunately, commit 773583fb5fda5a60dad97cbad141021b0cd5b334 undid the fixes for the issue of running df on everything (which potentially can run for extremely long time with many mount points), so no, it is not completed.

grufocom / vee-mail

Allow for a few retries? #36