mediacloud / backend

Media Cloud is an open source, open data platform that allows researchers to answer quantitative questions about the content of online media.
http://www.mediacloud.org
GNU Affero General Public License v3.0
278 stars 87 forks source link

3rd party service monitoring #639

Open pypt opened 4 years ago

pypt commented 4 years ago

Now that we consume 3rd party APIs as part of our core pipeline, we need some system for monitoring them to find out when they are down. This should be a sanity test to make sure the APIs we consume are still working, and email us or alert us in some way when they aren't.

@pypt suggested a Munin plugin could work for this. Should we queue this up as a dev task on the main MC repo, or is there some other solution that would be better?

pypt commented 4 years ago

I now think that having a separate Munin plugin that would monitor something like Facebook Graph API would not the most optimal solution because 1) we would have to come up with and maintain a separate implementation of the same 3rd party API client code in the plugin; 2) the plugin can't possibly test for all of the border cases.

I thereby would like to propose a possibly better approach: distinguishing between soft and hard failures, and quitting the process on the latter.

In any part of our processing chain, two types of errors can occur:

Right now, we treat both kinds of errors the same: we throw them as the same type of generic exception, and then the caller either catches them, or lets the process crash. I propose that after segregating those two kinds of errors (by throwing two different types of exceptions), we treat them differently too:

Simpliest way to report different kinds of errors is with exceptions:

class McResearchException(Exception):
    """Problems in research()."""
    pass

class McResearchSoftFailureException(McResearchException):
    """Soft errors in research()."""
    pass

class McResearchHardFailureException(McResearchException):
    """Hard errors in research()."""
    pass

While processing, the code should throw a different exception depending on the type of error:

def research_url(url):
    """Do important researchy stuff."""
    if url.is_invalid():
        raise McResearchSoftFailureException(
            "This particular URL didn't work, but maybe the other URL go through."
        )

    if funding.is_gone():
        raise McResearchHardFailureException(
            "Can't do nothing without money in this world so this is a hard error."
        )

The caller code (e.g. a Celery worker) then catches those different kinds of exceptions, and either retries the operation (if it's viable) / throws the exception further (to fail a current Celery job but continue the worker's operation itself) on soft errors, or exits the worker on hard errors:

retries = 3
for retry in range(0, 3):
    while True:
        try:
            research_url(url)
        except McResearchSoftFailureException as ex:
            l.info(f"Something has failed while doing research, but I'll retry: {ex}")
            continue
        except McResearchHardFailureException as ex:
            l.error(f"Critical error while doing research, can't continue: {ex}")
            sys.exit(1)
        except Exception as ex:
            # We haven't anticipated this kind of exception so assume that it's a hard error too
            l.error(f"Some other kind of exception happened for which we haven't planned: {ex}")

        l.info("Research completed, time to get published!")
        break

In my opinion, the soft-hard error approach has the following advantages:

I rewrote facebook-fetch-story-stats to Python with added soft-hard error separation to serve as a demo of how this would work:

https://github.com/berkmancenter/mediacloud/pull/640

Please let me know if this approach looks reasonable to you or if you see any downsides to it that I didn't think of.