scrapinghub / spidermon

Scrapy Extension for monitoring spiders execution.
https://spidermon.readthedocs.io
BSD 3-Clause "New" or "Revised" License
528 stars 96 forks source link

Base built-in monitor for stats validation #321

Closed rennerocha closed 2 years ago

rennerocha commented 2 years ago

Problem

A huge number of monitors follows the same pattern: we have one spider execution statistic and want to validate against a desired threshold. If we look the code of some built-in monitors we can see that the code of them are almost the same (for example ErrorCountMonitor, ItemCountMonitor, WarningCountMonitor and some others).

Adding a new monitor is not too complicated, but we end with a lot of almost equal code that reduces the readability of the code and make it possible to introduce errors (because sometimes we do a lot of copy-paste to create our monitors).

Solution

Add a new base monitor class that allows us to create new monitors based on a value in the jobs statistics just defining a new class and a few class arguments. Doing that, we can reduce the amount of code we need to create to add a simple monitor like mentioned before.

Draft of solution

Creating a base stats monitor that allows us to create a new custom monitor with 5 lines of code:

class BaseStatsMonitor(Monitor):

    def test_validate_stats(self):
        # breakpoint()
        threshold = self.data.spider.settings.getint(self.threshold_setting, 10)
        value = self.data.stats.get(self.stats_key)
        error_msg = f"Invalid {self.stats_key} value. Expected: {threshold}. Obtained: {value}"
        assertions = {
            "GT": self.assertGreater,
            "GTE": self.assertGreaterEqual,
            "LT": self.assertLess,
            "LTE": self.assertLessEqual,
            "EQ": self.assertEqual,
            "NEQ": self.assertNotEqual,
        }
        assertion_method = assertions.get(self.assert_type)
        assertion_method(value, threshold, error_msg)

class ItemCountMonitor(BaseStatsMonitor):
    stats_key = "item_scraped_count"
    threshold_setting = "SPIDERMON_MIN_ITEMS"
    assert_type = "GTE"
    description = "Should extract the minimum amount of items"

class ValidationErrorsMonitor(BaseStatsMonitor):
    stats_key = "spidermon/validation/fields/errors
    threshold_setting = "SPIDERMON_MAX_VALIDATION_ERRORS
    assert_type = "LTE"
    description = "Limit the number of validation" 
Gallaecio commented 2 years ago

I like the idea a lot. I do have some feedback for the implementation:

rennerocha commented 2 years ago

Thanks for the feedback @Gallaecio !

Do not provide a default threshold, that seems too arbitrary, instead fail if the setting is missing.

I am considering doing something similar that we have in Django Rest Framework to get serializers, and use it for the threshold, so we can have more of flexibility and are able to have dynamic thresholds if the user needs it:

class BaseStatsMonitor(Monitor):

    def get_threshold(self):
        self.data.spider.settings.getint(self.threshold_setting)

    def test_validate_stats(self):
        # breakpoint()
        threshold = self.get_threshold()
        ....

That way, we can set the threshold as a setting, or override get_threshold to have a more dynamic way to get it (for example, get the number of items in the last job).

I agree with the default value, we should raise NotConfigured if the threshold is not set.

rennerocha commented 2 years ago

Closed by https://github.com/scrapinghub/spidermon/pull/325