scrapinghub / spidermon

Scrapy Extension for monitoring spiders execution.
https://spidermon.readthedocs.io
BSD 3-Clause "New" or "Revised" License
530 stars 96 forks source link

Periodic item count monitor crashes if it runs before item_scraped_count exists in stats #427

Closed rvandam closed 7 months ago

rvandam commented 10 months ago

Encountered this error while playing with PeriodicItemCountMonitor

======================================================================
ERROR: Periodic Item Count Increase Monitor/test_stat_monitor
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/vandamr/.venv/46205CF6-02DD-499A-9EB8-71D8CB0FEA73/lib/python3.11/site-packages/spidermon/contrib/scrapy/monitors/base.py", line 213, in test_stat_monitor
    threshold = self._get_threshold_value()
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/vandamr/.venv/46205CF6-02DD-499A-9EB8-71D8CB0FEA73/lib/python3.11/site-packages/spidermon/contrib/scrapy/monitors/base.py", line 201, in _get_threshold_value
    return self.get_threshold()
           ^^^^^^^^^^^^^^^^^^^^
  File "/Users/vandamr/.venv/46205CF6-02DD-499A-9EB8-71D8CB0FEA73/lib/python3.11/site-packages/spidermon/contrib/scrapy/monitors/monitors.py", line 639, in get_threshold
    return prev_item_scraped_count + threshold_increase
           ~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~
TypeError: unsupported operand type(s) for +: 'NoneType' and 'int'

This happens in get_threshold if item_scraped_count doesn't exist yet on the first run so then prev_item_scraped_count gets set to None so on the second run the above error happens.

    def get_threshold(self):
        crawler = self.data.crawler
        prev_item_scraped_count = self.stats.get("prev_item_scraped_count", 0) # <-- on second run this is then None
        item_scraped_count = self.stats.get(self.stat_name) # <--- if this doesn't exist yet
        crawler.stats.set_value("prev_item_scraped_count", item_scraped_count) # <--- then this gets set to None 
        threshold_increase = crawler.settings.get(self.threshold_setting)
        if isinstance(threshold_increase, int):
            return prev_item_scraped_count + threshold_increase

I think this will probably fix it but haven't had a chance yet to test it:

diff --git a/spidermon/contrib/scrapy/monitors/monitors.py b/spidermon/contrib/scrapy/monitors/monitors.py
index b677803..e530633 100644
--- a/spidermon/contrib/scrapy/monitors/monitors.py
+++ b/spidermon/contrib/scrapy/monitors/monitors.py
@@ -632,7 +632,7 @@ class PeriodicItemCountMonitor(BaseStatMonitor):
     def get_threshold(self):
         crawler = self.data.crawler
         prev_item_scraped_count = self.stats.get("prev_item_scraped_count", 0)
-        item_scraped_count = self.stats.get(self.stat_name)
+        item_scraped_count = self.stats.get(self.stat_name, 0)

Alternatively, you could force prev to be zero like this prev_item_scraped_count = self.stats.get("prev_item_scraped_count", 0) or 0 but I think setting the stat to be None instead of 0 in the first place is the real bug.