wal-e / wal-e

Continuous Archiving for Postgres
BSD 3-Clause "New" or "Revised" License
3.46k stars 317 forks source link

FEATURE REQUEST: Method of monitoring completeness of WALs #309

Open bdurrow opened 7 years ago

bdurrow commented 7 years ago

We use nagios to monitor our infrastructure. We have deployed https://github.com/APSL/postgresql-wal-e-nagios to confirm that we have current backups but we would also like to know if WALs are late or there are any missing in the sequence. check-barman checks that the most recent WAL file is new enough and then checks for the presence of all WAL segments between the last backup and this most recent WAL file. I started working on adapting this logic to GCS (where we point WAL-E for storage) but it is non trivial logic. It seems like we could leverage the logic that currently supports wal deletion to do this. I'm willing to do this work (and test it in GCS) but I would like in the end to contribute that work back to this project. Do you have any pointers to offer someone in my position?

fdr commented 7 years ago

So, interestingly, I thought Near The Beginning Years Ago that this feature would be necessary for WAL-E, since then I have encountered WAL-gaps literally zero times, in which time WAL-E has archived easily a number of petabytes.

That said, I think it is still a good idea to have a feature to work with listing WAL. It would be quite useful to locate the last WAL segment, for example.

So, my pointer is: I might choose to get a bit more bang out of my implementation buck by writing such a feature with an eye towards performance and size statistics rather than being fixated on checking for gaps.

I welcome your changes. Consider writing a prototype first.

bdurrow commented 7 years ago

@fdr, Thank you for your feedback. When you suggest writing a prototype first what do you mean?

fdr commented 7 years ago

Basically submit something that touches the features you want but don't dwell too much on testing or naming things or refactoring. I would estimate such a prototype might take 25% of the time of the entire project.

pcarranza commented 7 years ago

I've an interest in this from the prometheus perspective. Not sure how would that fit with nagios.

I was thinking that the easiest way would be to signal completion of a wal segment to something else, maybe by using some notifier or even just dumping progress to a file that can be picked by a node exporter.

That would mean having the ability to register this notifier or just writing to a file on each wal archive group completion.

Would something like this make sense? @fdr

fdr commented 7 years ago

There are a couple of ways to do this, probably more roundabout than you'd like but also compatible with more archivers.

One is to monitor the postgres archive_status directory, e.g. with inotify. It has a simple naming scheme around "$NAME.ready" and "$NAME.done" segments.

That's a bit roundabout, but would it work?

SuperQ commented 7 years ago

@pcarranza If there are Prometheus metrics output, it can be easy enough to grab the current stat with whatever tool language and create a Nagios check from that.