openstreetmap / osmdbt

OSM Database Replication Tools
GNU General Public License v3.0
24 stars 6 forks source link

Osm-logical: sanity checks #16

Closed mmd-osm closed 4 years ago

mmd-osm commented 4 years ago

I want to change "sanity check" / "paranoia checks" in osm-logical to emit WARNING (+ ERROR?) log messages. This is in line with how wal2json handles unexpected critical situations: https://github.com/eulerto/wal2json/blob/master/wal2json.c#L1373

Previously, failed paranoia checks would simply add a string like "X NULL id or version" to the output. Once a critical error occurs, osmdbt-get-log should at least report this somehow, and depending on the severity of the issue instead return an error code to stop further processing.

At the moment, osmdbt-get-log would only display:

[ 0:00] Reading replication log...
[ 0:00] No changes found.
[ 0:00] Did not write log file

WIth the proposed changes in #17, this turns into:

[ 0:00] Reading replication log...
[ 0:00] Logical decoding output plugin 'osm-logical' log messages:
[ 0:00] WARNING:  NULL id, version or changeset id in table "relations" (10001 times)
[ 0:00] 
[ 0:00] There are 1 entries in the replication log.

Alternative as ERROR:

[ 0:00] Reading replication log...
ERROR:  NULL id, version or changeset id in table "relations"
CONTEXT:  slot "rs2", output plugin "osm-logical", in the change callback, associated LSN AB/5F7E9390

A bit of fine tuning might be required here to decide which log messages should be raised as warning (hence not aborting the process), and which are ERRORs that abort the process. One of the downsides of "ERROR" is that catching up via "pg_logical_slot_get_changes" will not be possible anymore, as Postgresql has to process the faulty entries first in order to get rid of them (which is kind of futile). WARNING seems like a better choice here.


To simulate this error I used the following steps:

psql -d openstretmap
alter table relations rename column changeset_id to changeset_id2;

Insert a new row into table relations

Run osmdbt-get-log

Cleanup: Change column name back after test

There's probably a number of ways to trigger an issue. Something like alter table relations alter column changeset_id drop not null;, and inserting NULL values for the changeset_id would of course also do

Originally posted by @mmd-osm in https://github.com/openstreetmap/osmdbt/pull/12#issuecomment-656873548

joto commented 4 years ago

I don't think the added complexity is worth it here. Either these are errors which can actually happen in real life in which case we must handle them. Or they are not, which I think is the case here. But then doing the simplest thing is enough to get at least something when testing or so. And that's what we have. We can spend forever coming up with scenarios how this can all fail, but at some point we have to ask ourselves whether it is worth it.

mmd-osm commented 4 years ago

I see that those errors are currently printed to stderr, so they're no totally lost:

        } else if (parts[2] == "X") {
            std::cerr << "Error found in logfile: " << line << '\n';

I think the basic question here is, what would a sysadmin do with those messages? Would they be part of some monitoring, or disappear into void?

At least the elog warnings would also appear as Warning in the Postgresql log files, where they might be easier to consume for an automated monitoring. Unfortunately, I'm not aware of any guidelines here, so this may all a bit theoretical after all.