vierbergenlars / bareos_exporter

MIT License
12 stars 7 forks source link

BareOS exporter stops working when there multiple filesets with the same name #10

Closed karlism closed 3 years ago

karlism commented 3 years ago

BareOS exporter (version 0.5.6) stops working when there are failed backup jobs:

$ curl -s localhost:9625/metrics
An error has occurred while serving metrics:

480 error(s) occurred:
* collected metric "bareos_jobs_run" { label:<name:"client" value:"devdb2" > label:<name:"fileset" value:"PostgreSQL" > label:<name:"jobname" value:"backup-devdb2-postgresql" > label:<name:"jobtype" value:"B" > counter:<value:8 > } was collected before with the same name and label values
* collected metric "bareos_bytes_saved" { label:<name:"client" value:"devdb2" > label:<name:"fileset" value:"PostgreSQL" > label:<name:"jobname" value:"backup-devdb2-postgresql" > label:<name:"jobtype" value:"B" > counter:<value:0 > } was collected before with the same name and label values
* collected metric "bareos_files_saved" { label:<name:"client" value:"devdb2" > label:<name:"fileset" value:"PostgreSQL" > label:<name:"jobname" value:"backup-devdb2-postgresql" > label:<name:"jobtype" value:"B" > counter:<value:0 > } was collected before with the same name and label values
* collected metric "bareos_last_job_bytes_saved" { label:<name:"client" value:"devdb2" > label:<name:"fileset" value:"PostgreSQL" > label:<name:"jobname" value:"backup-devdb2-postgresql" > label:<name:"jobtype" value:"B" > label:<name:"last_selector" value:"last" > counter:<value:0 > } was collected before with the same name and label values
* collected metric "bareos_last_job_files_saved" { label:<name:"client" value:"devdb2" > label:<name:"fileset" value:"PostgreSQL" > label:<name:"jobname" value:"backup-devdb2-postgresql" > label:<name:"jobtype" value:"B" > label:<name:"last_selector" value:"last" > counter:<value:0 > } was collected before with the same name and label values
* collected metric "bareos_last_job_errors" { label:<name:"client" value:"devdb2" > label:<name:"fileset" value:"PostgreSQL" > label:<name:"jobname" value:"backup-devdb2-postgresql" > label:<name:"jobtype" value:"B" > label:<name:"last_selector" value:"last" > counter:<value:1 > } was collected before with the same name and label values
* collected metric "bareos_last_job_start_unix_timestamp" { label:<name:"client" value:"devdb2" > label:<name:"fileset" value:"PostgreSQL" > label:<name:"jobname" value:"backup-devdb2-postgresql" > label:<name:"jobtype" value:"B" > label:<name:"last_selector" value:"last" > counter:<value:1.626365576e+09 > } was collected before with the same name and label values
* collected metric "bareos_last_job_end_unix_timestamp" { label:<name:"client" value:"devdb2" > label:<name:"fileset" value:"PostgreSQL" > label:<name:"jobname" value:"backup-devdb2-postgresql" > label:<name:"jobtype" value:"B" > label:<name:"last_selector" value:"last" > counter:<value:1.626365577e+09 > } was collected before with the same name and label values
* collected metric "bareos_last_job_status" { label:<name:"client" value:"devdb2" > label:<name:"fileset" value:"PostgreSQL" > label:<name:"jobname" value:"backup-devdb2-postgresql" > label:<name:"jobtype" value:"B" > label:<name:"last_selector" value:"last" > label:<name:"status" value:"C" > counter:<value:0 > } was collected before with the same name and label values
* collected metric "bareos_last_job_status" { label:<name:"client" value:"devdb2" > label:<name:"fileset" value:"PostgreSQL" > label:<name:"jobname" value:"backup-devdb2-postgresql" > label:<name:"jobtype" value:"B" > label:<name:"last_selector" value:"last" > label:<name:"status" value:"R" > counter:<value:0 > } was collected before with the same name and label values
* collected metric "bareos_last_job_status" { label:<name:"client" value:"devdb2" > label:<name:"fileset" value:"PostgreSQL" > label:<name:"jobname" value:"backup-devdb2-postgresql" > label:<name:"jobtype" value:"B" > label:<name:"last_selector" value:"last" > label:<name:"status" value:"B" > counter:<value:0 > } was collected before with the same name and label values
* collected metric "bareos_last_job_status" { label:<name:"client" value:"devdb2" > label:<name:"fileset" value:"PostgreSQL" > label:<name:"jobname" value:"backup-devdb2-postgresql" > label:<name:"jobtype" value:"B" > label:<name:"last_selector" value:"last" > label:<name:"status" value:"T" > counter:<value:0 > } was collected before with the same name and label values
* collected metric "bareos_last_job_status" { label:<name:"client" value:"devdb2" > label:<name:"fileset" value:"PostgreSQL" > label:<name:"jobname" value:"backup-devdb2-postgresql" > label:<name:"jobtype" value:"B" > label:<name:"last_selector" value:"last" > label:<name:"status" value:"E" > counter:<value:0 > } was collected before with the same name and label values
* collected metric "bareos_last_job_status" { label:<name:"client" value:"devdb2" > label:<name:"fileset" value:"PostgreSQL" > label:<name:"jobname" value:"backup-devdb2-postgresql" > label:<name:"jobtype" value:"B" > label:<name:"last_selector" value:"last" > label:<name:"status" value:"e" > counter:<value:0 > } was collected before with the same name and label values

Here's database entries related to this backup job:

$ echo "select * from job where name='backup-devdb2-postgresql';" | psql bareos -x
-[ RECORD 1 ]---+---------------------------------------------------
jobid           | 253
job             | backup-devdb2-postgresql.2021-07-15_15.29.01_08
name            | backup-devdb2-postgresql
type            | B
level           | F
clientid        | 4
jobstatus       | f
schedtime       | 2021-07-15 15:29:01
starttime       | 2021-07-15 15:29:03
endtime         | 2021-07-15 15:29:03
realendtime     | 2021-07-15 15:29:03
jobtdate        | 1626362943
volsessionid    | 215
volsessiontime  | 1625060238
jobfiles        | 0
jobbytes        | 0
readbytes       | 0
joberrors       | 1
jobmissingfiles | 0
poolid          | 3
filesetid       | 17
priorjobid      | 0
purgedfiles     | 0
hasbase         | 0
hascache        | 1
reviewed        | 0
comment         | 
-[ RECORD 2 ]---+---------------------------------------------------
jobid           | 247
job             | backup-devdb2-postgresql.2021-07-15_14.09.47_53
name            | backup-devdb2-postgresql
type            | B
level           | F
clientid        | 4
jobstatus       | f
schedtime       | 2021-07-15 14:09:47
starttime       | 2021-07-15 14:09:49
endtime         | 2021-07-15 14:09:50
realendtime     | 2021-07-15 14:09:50
jobtdate        | 1626358190
volsessionid    | 209
volsessiontime  | 1625060238
jobfiles        | 0
jobbytes        | 0
readbytes       | 0
joberrors       | 2
jobmissingfiles | 0
poolid          | 3
filesetid       | 14
priorjobid      | 0
purgedfiles     | 0
hasbase         | 0
hascache        | 1
reviewed        | 0
comment         | 
-[ RECORD 3 ]---+---------------------------------------------------
jobid           | 244
job             | backup-devdb2-postgresql.2021-07-15_10.13.56_17
name            | backup-devdb2-postgresql
type            | B
level           | F
clientid        | 4
jobstatus       | f
schedtime       | 2021-07-15 10:13:56
starttime       | 2021-07-15 10:13:59
endtime         | 2021-07-15 10:13:59
realendtime     | 2021-07-15 10:13:59
jobtdate        | 1626344039
volsessionid    | 206
volsessiontime  | 1625060238
jobfiles        | 0
jobbytes        | 0
readbytes       | 0
joberrors       | 1
jobmissingfiles | 0
poolid          | 3
filesetid       | 15
priorjobid      | 0
purgedfiles     | 0
hasbase         | 0
hascache        | 1
reviewed        | 0
comment         | 
-[ RECORD 4 ]---+---------------------------------------------------
jobid           | 250
job             | backup-devdb2-postgresql.2021-07-15_14.28.29_23
name            | backup-devdb2-postgresql
type            | B
level           | F
clientid        | 4
jobstatus       | f
schedtime       | 2021-07-15 14:28:29
starttime       | 2021-07-15 14:28:31
endtime         | 2021-07-15 14:28:32
realendtime     | 2021-07-15 14:28:32
jobtdate        | 1626359312
volsessionid    | 212
volsessiontime  | 1625060238
jobfiles        | 0
jobbytes        | 0
readbytes       | 0
joberrors       | 2
jobmissingfiles | 0
poolid          | 3
filesetid       | 14
priorjobid      | 0
purgedfiles     | 0
hasbase         | 0
hascache        | 1
reviewed        | 0
comment         | 
-[ RECORD 5 ]---+---------------------------------------------------
jobid           | 242
job             | backup-devdb2-postgresql.2021-07-15_10.00.18_21
name            | backup-devdb2-postgresql
type            | B
level           | F
clientid        | 4
jobstatus       | f
schedtime       | 2021-07-15 10:00:18
starttime       | 2021-07-15 10:00:21
endtime         | 2021-07-15 10:00:21
realendtime     | 2021-07-15 10:00:21
jobtdate        | 1626343221
volsessionid    | 204
volsessiontime  | 1625060238
jobfiles        | 0
jobbytes        | 0
readbytes       | 0
joberrors       | 1
jobmissingfiles | 0
poolid          | 3
filesetid       | 14
priorjobid      | 0
purgedfiles     | 0
hasbase         | 0
hascache        | 1
reviewed        | 0
comment         | 
-[ RECORD 6 ]---+---------------------------------------------------
jobid           | 248
job             | backup-devdb2-postgresql.2021-07-15_14.25.03_05
name            | backup-devdb2-postgresql
type            | B
level           | F
clientid        | 4
jobstatus       | f
schedtime       | 2021-07-15 14:25:03
starttime       | 2021-07-15 14:25:06
endtime         | 2021-07-15 14:25:06
realendtime     | 2021-07-15 14:25:06
jobtdate        | 1626359106
volsessionid    | 210
volsessiontime  | 1625060238
jobfiles        | 0
jobbytes        | 0
readbytes       | 0
joberrors       | 1
jobmissingfiles | 0
poolid          | 3
filesetid       | 14
priorjobid      | 0
purgedfiles     | 0
hasbase         | 0
hascache        | 1
reviewed        | 0
comment         | 
-[ RECORD 7 ]---+---------------------------------------------------
jobid           | 243
job             | backup-devdb2-postgresql.2021-07-15_10.07.54_05
name            | backup-devdb2-postgresql
type            | B
level           | F
clientid        | 4
jobstatus       | f
schedtime       | 2021-07-15 10:07:54
starttime       | 2021-07-15 10:07:57
endtime         | 2021-07-15 10:07:57
realendtime     | 2021-07-15 10:07:57
jobtdate        | 1626343677
volsessionid    | 205
volsessiontime  | 1625060238
jobfiles        | 0
jobbytes        | 0
readbytes       | 0
joberrors       | 1
jobmissingfiles | 0
poolid          | 3
filesetid       | 15
priorjobid      | 0
purgedfiles     | 0
hasbase         | 0
hascache        | 1
reviewed        | 0
comment         | 
-[ RECORD 8 ]---+---------------------------------------------------
jobid           | 252
job             | backup-devdb2-postgresql.2021-07-15_15.17.23_04
name            | backup-devdb2-postgresql
type            | B
level           | F
clientid        | 4
jobstatus       | f
schedtime       | 2021-07-15 15:17:23
starttime       | 2021-07-15 15:17:26
endtime         | 2021-07-15 15:17:26
realendtime     | 2021-07-15 15:17:26
jobtdate        | 1626362246
volsessionid    | 214
volsessiontime  | 1625060238
jobfiles        | 0
jobbytes        | 0
readbytes       | 0
joberrors       | 1
jobmissingfiles | 0
poolid          | 3
filesetid       | 16
priorjobid      | 0
purgedfiles     | 0
hasbase         | 0
hascache        | 1
reviewed        | 0
comment         | 
-[ RECORD 9 ]---+---------------------------------------------------
jobid           | 246
job             | backup-devdb2-postgresql.2021-07-15_14.06.32_39
name            | backup-devdb2-postgresql
type            | B
level           | F
clientid        | 4
jobstatus       | f
schedtime       | 2021-07-15 14:06:32
starttime       | 2021-07-15 14:06:34
endtime         | 2021-07-15 14:06:35
realendtime     | 2021-07-15 14:06:35
jobtdate        | 1626357995
volsessionid    | 208
volsessiontime  | 1625060238
jobfiles        | 0
jobbytes        | 0
readbytes       | 0
joberrors       | 1
jobmissingfiles | 0
poolid          | 3
filesetid       | 14
priorjobid      | 0
purgedfiles     | 0
hasbase         | 0
hascache        | 1
reviewed        | 0
comment         | 

Judging by Prometheus up metric, I can see that exporter stopped working around 10:08, thus I conclude that this failure was triggered by jobid 243. I'm using BareOS 20.0.1 and PostgreSQL 13 backend database.

vierbergenlars commented 3 years ago

Can you check the results for the following query: SELECT j.Name, j.Type, j.ClientId, j.FileSetId, COALESCE(c.Name, ''), COALESCE(f.FileSet, ''), COUNT(*), SUM(j.JobBytes), SUM(j.JobFiles) FROM job j LEFT JOIN client c ON c.ClientId = j.ClientId LEFT JOIN fileset f ON f.FileSetId = j.FileSetId GROUP BY j.Name, j.Type, j.ClientId, j.FileSetId, c.Name, f.FileSet HAVING MAX(j.SchedTime) >= '2021-07-15 00:00:00'

For my case, the first column (j.Name) (the job ID) is unique, but from the query output you posted, it looks like you have multiple different filesets for the same job name, which I did not know could happen during normal operations.

To determine how bareos_exporter can handle this situation, it would be helpful to know how this situation occured. It looks like you were trying to get a job working, and trying to get it working. So I wonder if you modified the fileset at som e point? In that case, the failed jobs are just not the actual problem, but multiple rows for a job name being returned. As a quick fix, you can delete the failed jobs with delete jobid=? in bconsole. Or you can wait for 7 days (default, or the number specified with the -job-discovery-days argument) until the jobs with an obsolete fileset ID are outside the window considered by bareos_exporter.

vierbergenlars commented 3 years ago

Hmm, I tried to reproduce this case, where a job has different filesets, and that appears to work.

I have no clue what exactly goes wrong here.

@karlism if you can run the query from my previous comment, that would be very helpful.

I don't know what to look for exactly yet, but the combination (j.Name, j.Type, c.Name, f.FileSet) should be unique in the results.

Although it is not enforced in the database schema, I believe j.ClientId and c.Name (client name) map one-to-one to each other, as well as j.FileSetId and f.FileSet. It might be interesting to look at your client and fileset tables to see if my assumption there is correct.

karlism commented 3 years ago

@vierbergenlars, thanks for looking into this!

echo "SELECT j.Name, j.Type, j.ClientId, j.FileSetId, COALESCE(c.Name, ''), COALESCE(f.FileSet, ''), COUNT(*), SUM(j.JobBytes), SUM(j.JobFiles) FROM job j LEFT JOIN client c ON c.ClientId = j.ClientId LEFT JOIN fileset f ON f.FileSetId = j.FileSetId GROUP BY j.Name, j.Type, j.ClientId, j.FileSetId, c.Name, f.FileSet HAVING MAX(j.SchedTime) >= '2021-07-15 00:00:00';" | psql bareos
                  name                   | type | clientid | filesetid |    coalesce    |         coalesce         | count |     sum     |   sum   
-----------------------------------------+------+----------+-----------+----------------+--------------------------+-------+-------------+---------
 backup-xxxdeploytest1-vm-xxxdeploytest1 | B    |       11 |        21 | xxxdeploytest1 | vm-xxxdeploytest1        |     6 |  8274234063 |      10
 backup-xxxdevketcd1-etc                 | B    |        2 |         2 | xxxdevketcd1   | LinuxEtc                 |    23 |    11123529 |    1682
 backup-xxxawx1-postgresql               | B    |       13 |        28 | xxxawx1        | PostgreSQL_13_RedHat     |     4 |           0 |       0
 backup-xxxdevdb2-postgresql             | B    |        4 |        17 | xxxdevdb2      | PostgreSQL               |     2 |           0 |       0
 backup-xxxdevdb2-postgresql             | B    |        4 |        27 | xxxdevdb2      | PostgreSQL               |     1 |           0 |       0
 backup-yyydevketcd1-etc                 | B    |        3 |         2 | yyydevketcd1   | LinuxEtc                 |    22 |    11122419 |    1681
 backup-xxxdevdb2-postgresqlbasebackup   | B    |        4 |         3 | xxxdevdb2      | PostgreSQL_pg_basebackup |    29 |   160831503 |   36801
 backup-xxxdevdb2-postgresql             | B    |        4 |        26 | xxxdevdb2      | PostgreSQL               |     1 |           0 |       0
 backup-xxxdevdb2-postgresql             | B    |        4 |        14 | xxxdevdb2      | PostgreSQL               |     9 |     5516504 |    1287
 backup-xxxdevdb2-postgresql             | B    |        4 |        24 | xxxdevdb2      | PostgreSQL               |     1 |           0 |       0
 backup-xxxexchange1-exchangeconf        | B    |        8 |        10 | xxxexchange1   | Exchange_Conf            |     7 | 10254884770 |  133607
 backup-yyyexchange1-windowsc            | B    |        5 |         9 | yyyexchange1   | WindowsC                 |    14 | 61807025613 | 1053778
 backup-zzzdeploytest3-etc               | B    |        6 |         2 | zzzdeploytest3 | LinuxEtc                 |    17 |      539737 |    1315
 backup-xxxexchange1-windowsc            | B    |        8 |         9 | xxxexchange1   | WindowsC                 |     9 | 28717961249 |  526179
 backup-xxxdevdb2-postgresql             | B    |        4 |        16 | xxxdevdb2      | PostgreSQL               |     1 |           0 |       0
 backup-yyydeploytest1-etc               | B    |       12 |         2 | yyydeploytest1 | LinuxEtc                 |     3 |     5424299 |     795
 backup-xxxdeploytest1-etc               | B    |       11 |         2 | xxxdeploytest1 | LinuxEtc                 |     5 |     5436981 |     829
 backup-xxxdevdb2-postgresql             | B    |        4 |        25 | xxxdevdb2      | PostgreSQL               |     1 |           0 |       0
 backup-xxxdevdb2-postgresql             | B    |        4 |        19 | xxxdevdb2      | PostgreSQL               |     8 |    27681914 |    6439
 backup-xxxad1-windowsc                  | B    |        9 |         9 | xxxad1         | WindowsC                 |     7 | 17714683477 |  397606
 RestoreFiles                            | R    |        4 |         0 | xxxdevdb2      |                          |     2 |   107559404 |    2559
 backup-yyydevketcd1-etcd                | B    |        3 |        13 | yyydevketcd1   | etcd_all                 |    13 |   132666174 |      76
 backup-yyygit1-etc                      | B    |        7 |         2 | yyygit1        | LinuxEtc                 |    17 |     7283331 |    5728
 backup-xxxdevdb2-postgresql             | B    |        4 |        18 | xxxdevdb2      | PostgreSQL               |     3 |           0 |       0
 backup-yyyexchange1-exchangedb          | B    |        5 |        11 | yyyexchange1   | Exchange_DB              |    38 |  3148882950 |   67267
 backup-yyyexchange1-exchangeconf        | B    |        5 |        10 | yyyexchange1   | Exchange_Conf            |    15 | 21539936089 |  251007
 backup-yyydeploytest1-vm-yyydeploytest1 | B    |       12 |        23 | yyydeploytest1 | vm-yyydeploytest1        |     8 |  8725194082 |      10
 backup-xxxexchange1-exchangedb          | B    |        8 |        11 | xxxexchange1   | Exchange_DB              |    10 |  1954248567 |   42427
 backup-yyygit1-gitlab_backup            | B    |        7 |         6 | yyygit1        | GitLab_backup            |    17 | 42480888703 |      37
 backup-yyyad1-windowsc                  | B    |       10 |         9 | yyyad1         | WindowsC                 |     7 | 16640927093 |  391981
 RestoreFiles                            | R    |       12 |         0 | yyydeploytest1 |                          |     9 | 13031844378 |       6
 backup-yyybackup1-mycatalog             | B    |        1 |         1 | yyybackup1     | Catalog                  |    22 |  1469221104 |    3303
 backup-xxxdeploytest1-vm-xxxdeploytest1 | B    |       11 |        20 | xxxdeploytest1 | vm-xxxdeploytest1        |     1 |           0 |       0
 backup-xxxdevdb2-postgresql             | B    |        4 |        28 | xxxdevdb2      | PostgreSQL_13_RedHat     |     4 |           0 |       0
 backup-yyydeploytest1-vm-yyydeploytest1 | B    |       12 |        22 | yyydeploytest1 | vm-yyydeploytest1        |     1 |           0 |       0
 backup-xxxdevdb2-postgresql             | B    |        4 |        15 | xxxdevdb2      | PostgreSQL               |     2 |           0 |       0
(36 rows)

Yes, it is possible that fileset has been changed at some point as we are now in process of switching from different backup solution to BareOS so me and my colleagues are doing lots of tests and adjustments to existing BareOS configuration.

vierbergenlars commented 3 years ago

According to this, bareos can have multiple entries with the same fileset name. Bareos will select the newest one.

When creating a new fileset record, bareos will first check if one with the same name and MD5 sum exists (https://github.com/bareos/bareos/blob/a89617abc0f40425a1a9faa00c1825200be6777a/core/src/cats/sql_create.cc#L702-L705), if it does not, it will insert a new record (https://github.com/bareos/bareos/blob/a89617abc0f40425a1a9faa00c1825200be6777a/core/src/cats/sql_create.cc#L747-L750).

So, multiple filesets with the same name and different ids should be handled. Ideally, they are handled as the single fileset they are. The difference between these is the contents of include/exclude rules, which we don't care about as a metrics exporter.

karlism commented 3 years ago

@vierbergenlars, thanks a lot for fixing this! I will try it out as soon as 0.5.7 binaries are published.