romainsi / zabbix-VB-R-SQL

Monitore VB&R with SQL query
33 stars 15 forks source link

Copy Job is idle in Veeam, but data in Zabbix show 0% progress and last end time can not be calculated #20

Open thmcon opened 2 years ago

thmcon commented 2 years ago

Hello,

we have several copy jobs in veeam. For all but one the data in zabbix matches the information you can see in the veeam gui. I compared the errornous job "Copy SQL Server" with the correct job "Copy Linux" and took the following information:

Veeam Status Information: image

Job Session Status as in Veeam SQL database: 2022-07-04_09h49_46

2022-07-04_09h51_43

JSON which zabbix receives in both cases: 2022-07-04_09h52_00

2022-07-04_09h50_12

2022-07-04_09h50_47

Current data of job "Copy SQL Server" and "Copy Linux": 2022-07-04_09h53_31

2022-07-04_09h53_15

=> I am convinced that the issue is not within zabbix. I see the different job status information in the veeam database. But I do not understand why veeam is showing different information in the database between those two jobs. In the veeam gui they look exactly the same for me: last run is completed without errors and now the jobs are both idle.

Are their certain conditions and scenarios that need to be covered in the script that gets executed by the zabbix agent 2 on the veeam server? As an administrator I check the veeam job status via the veeam gui and all is fine. I would like to see the same status on zabbix :-)

aholiveira commented 2 years ago

Hello @thmcon, The script is pulling information from the last job session from the database, which is why you are also seeing different information on Zabbix. I don't think we should modify the results that the script is reading because there could be other situations where the information is actually correct. From your other issue I see that you Veeam installation is not running the latest version of Veeam (11.0.1.1261 P20220302). Can you upgrade to that version and check again? It could be an issue with Veeam itself. I'm managing three different infrastructures where Veeam is installed (in completely different companies). In all of those there are Copy jobs (type 63), which, when idle, show the correct information (100% completed, last end time correctly populated). It is also strange that you have different results when the jobs are idle on the same job type (51). If that was consistent then maybe we could modify the script to accommodate for that, but as it stands we can't really know which one is right and which one is wrong. My suggested action plan: 1 - Upgrade Veeam if possible and let the jobs run on the new version. Check if results are consistent. 2 - If the problem persists, then, if possible, delete job "Copy SQLServer" and recreate. Check the results again. 3 - If there's still differences we could maybe ask a question to Veeam directly about this, and what would the expected behaviour be. It could be because one is a Windows machine and the other is Linux. Not sure if they will reply to this, though.

I can't really test this out since I'm only using Hyper-V (you are using VMWare), and I don't have any Sync jobs. I can try to see if I can create a test job with that but I'm not sure about it.

Regards.

thmcon commented 2 years ago

Hello @aholiveira,

I will follow your suggestions and first update veeam to the latest version and then procede with the next steps you mentioned. Hopefully I will find a time slot for that tomorrow. I'll let you know of the results after the update.

Regards,

thmcon commented 2 years ago

First Update:

  1. Veeam updated to latest available version image

  2. Copy Job "Copy SQLServer" re-startet with option "Active Full" => result: no changes regarding this job's database record

  3. Deleted Copy Job and created new copy job with the same settings as before: job type id again 51. After completion of the first run: job progress = 100 (before it was always 0), job state = 9 job end time = real time (before it was always 01-01-1900)

image

image

In zabbix I do not get a problem any more and the progress is also correct.

If it remains like this, it's fine for me.

Regards,

thmcon commented 2 years ago

Update 2:

What was quite strange: the veeam gui did not give an information about the last result of that job (you can see this in the screenshot above).

I've started a new sync period manually ("Sync now"), which executed successfully. After that I checked the database records again: image

image

And we have the same data as before the update of veeam:

I compared the different Copy Jobs and found out, that "Copy Linux" (which was created in a previous version of Veeam) has the same behaviour: image

If you need further information, I'll try to provide it. Currently I am out of ideas where to continue...

thmcon commented 2 years ago

Update:

  1. technically those jobs are sync jobs, not copy jobs (at least when I compare the job type array from zabbix_vbr_job.ps1): 0 = "Job"; 1 = "Replication"; 2 = "File"; 28 = "Tape"; 51 = "Sync"; 63 = "Copy"; 4030 = "RMAN"; 12002 = "Agent backup policy"; 12003 = "Agent backup job";
  2. In order to prevent "false problems" by the behaviour of the database entrys of veeam sync jobs I changed the following in my local zabbix-vbr template: a. added a prototype item inside "veem jobs discovery" to capture and store the job type image b. added a macro to exclude certain job types from beeing warned after "too long duration": image c. adjusted the trigger that warns if job does not complete within a certain time, to exclude the job types from macro created in point 2b. image

Maybe that is a proper handling for the template in general?

aholiveira commented 2 years ago

Hello @thmcon,

Thank you for your update on this. Last week I was rather busy, and this coming week I'll also be quite busy, so I won't have much time to look into this. I had also thought of excluding the trigger on some of the job types, but hadn't got around to do that yet. On the infrastructures I'm managing there are also some "sync" jobs, but not of the same type. I'll try to investigate this further some time next week. Also, I wanted to try and make this template compliant to the Zabbix template official guidelines. Right now, most of it is already compliant, but a few minor changes are still necessary. I'll try and do a pull request with some of those changes and taking your input and some additional tests into consideration.

Let me also take this opportunity to clarify something take might not be clear: @romainsi is the author of this template. I'm simply just another user who has found this template useful and I'm contributing with my personal input, in the spirit of open source, giving the changes, which could be useful to others, back to the project.

Regards.

romainsi commented 2 years ago

Hello,

Indeed with BackupSync in continuous mode the last result in the history is the current one (always success and if failed it's a new history line) and not the last result. I modified the script (not the last release) to query the last 2 results and add this condition :

##line 319
                # Exception BackupSync continuous state
                if ($LastJobSession.job_type -like '51' -and $lastsession.state -like '9') { 
                $LastJobSession = $LastJobSessions | Sort-Object end_time -Descending | Select-Object -Last 1
                }

Thank you for testing, Regards.