3rd party service monitoring

I now think that having a separate Munin plugin that would monitor something like Facebook Graph API would not the most optimal solution because 1) we would have to come up with and maintain a separate implementation of the same 3rd party API client code in the plugin; 2) the plugin can't possibly test for all of the border cases.

I thereby would like to propose a possibly better approach: distinguishing between soft and hard failures, and quitting the process on the latter.

In any part of our processing chain, two types of errors can occur:

Soft errors happen when something goes wrong but we can reasonably expect that we can continue operation. Typical soft errors are usually faulty inputs: bad URLs, nonexistent stories / downloads, webpage fetch errors, etc. On soft errors, we still have to report failure back to the worker script (in the form of exception) but also we should continue running whatever it is that we're running (e.g. by failing a fetch operation for a particular story and moving on to a new one).
Hard errors happen when something goes wrong and it doesn't look like we can continue operating with current or any future inputs. Hard errors usually occur when assertions are not met or when a program encounters something that it (or the programmer) hasn't seen before, e.g. HTTP client unable to decode JSON on requests when response is expected to be JSON in all cases, list parameter instead of an expected dict, required configuration parameters not being set, or something along those lines.

Right now, we treat both kinds of errors the same: we throw them as the same type of generic exception, and then the caller either catches them, or lets the process crash. I propose that after segregating those two kinds of errors (by throwing two different types of exceptions), we treat them differently too:

On soft errors, we should log the error but continue operation with subsequent inputs as we can reasonably expect for the failure to not be permanent and mess up other inputs too. For example, if a particular story's URL can't be parsed, that doesn't mean that the next story will fail too; if a story that's being processed doesn't exist, maybe another one will be present; if a webpage fetch fails, then we can just try the next URL.
On hard errors, we should log the error and stop operation as there's no point in continuing further - other inputs are likely to fail too. For example, if we can't decode API's JSON response, that might mean that the API is down for good, or maybe the endpoint has been changed and we haven't caught up with that fact - point is, we don't really know what has happened so it is beneficial for us to take a look. If a function receives a list as a parameter but it has expected a dict, that might be a programmer error on which we should just quit the process too as the bug will just continue on failing with other inputs too. My proposed way to deal with hard errors is to just exit the process with a non-zero exit code (sys.exit(1)) -- if it's a Celery worker, Docker will log the failure and - if configured so - will try restarting the worker a couple of times until giving up.

Simpliest way to report different kinds of errors is with exceptions:

class McResearchException(Exception):
    """Problems in research()."""
    pass

class McResearchSoftFailureException(McResearchException):
    """Soft errors in research()."""
    pass

class McResearchHardFailureException(McResearchException):
    """Hard errors in research()."""
    pass

While processing, the code should throw a different exception depending on the type of error:

def research_url(url):
    """Do important researchy stuff."""
    if url.is_invalid():
        raise McResearchSoftFailureException(
            "This particular URL didn't work, but maybe the other URL go through."
        )

    if funding.is_gone():
        raise McResearchHardFailureException(
            "Can't do nothing without money in this world so this is a hard error."
        )

The caller code (e.g. a Celery worker) then catches those different kinds of exceptions, and either retries the operation (if it's viable) / throws the exception further (to fail a current Celery job but continue the worker's operation itself) on soft errors, or exits the worker on hard errors:

retries = 3
for retry in range(0, 3):
    while True:
        try:
            research_url(url)
        except McResearchSoftFailureException as ex:
            l.info(f"Something has failed while doing research, but I'll retry: {ex}")
            continue
        except McResearchHardFailureException as ex:
            l.error(f"Critical error while doing research, can't continue: {ex}")
            sys.exit(1)
        except Exception as ex:
            # We haven't anticipated this kind of exception so assume that it's a hard error too
            l.error(f"Some other kind of exception happened for which we haven't planned: {ex}")

        l.info("Research completed, time to get published!")
        break

In my opinion, the soft-hard error approach has the following advantages:

Makes it easier to spot errors on which we can't continue a specific operation, e.g. third party API spec changes - the process itself would quit, which is something that we can spot in Docker Swarm / Portainer / some other container monitoring solution.
Prevents us from having and maintaining two different implementations of the same code (one for production, another one for monitoring).
Is pretty easy to implement.
Is pretty easy to test -- with pytest.raises(), one can verify that the correct types of exceptions get raised.
On hard errors, breaks the processing chain early, so we would spot problems earlier, we would be more likely to spot those problems ourselves (instead of one of our users reporting that "Facebook stats look off"), and we wouldn't have to drop everything and reprocess one or more steps in the processing chain manually to recover from such errors.

I rewrote facebook-fetch-story-stats to Python with added soft-hard error separation to serve as a demo of how this would work:

https://github.com/berkmancenter/mediacloud/pull/640

Please let me know if this approach looks reasonable to you or if you see any downsides to it that I didn't think of.

mediacloud / backend

3rd party service monitoring #639