scrapinghub / frontera

A scalable frontier for web crawlers
BSD 3-Clause "New" or "Revised" License
1.29k stars 216 forks source link

MessageBusBackend returning null states, breaking SchedulerSpiderMiddleware #338

Closed anjackson closed 6 years ago

anjackson commented 6 years ago

I'm attempting to use frontera 0.8.0 with SQLAlchemy and Kafka. I think I've got the strategy and database workers working, but when I try to run the spider I see this:

2018-07-28 09:46:21 [scrapy.core.scraper] ERROR: Spider error processing <GET http://data.webarchive.org.uk/crawl-test-site/> (referer: None)
Traceback (most recent call last):
  File "/Users/andy/Documents/workspace/huntsman/vev/lib/python3.6/site-packages/scrapy/utils/defer.py", line 102, in iter_errback
    yield next(it)
  File "/Users/andy/Documents/workspace/huntsman/vev/lib/python3.6/site-packages/scrapy/spidermiddlewares/offsite.py", line 30, in process_spider_output
    for x in result:
  File "/Users/andy/Documents/workspace/huntsman/vev/lib/python3.6/site-packages/scrapy/spidermiddlewares/referer.py", line 339, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "/Users/andy/Documents/workspace/huntsman/vev/lib/python3.6/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/Users/andy/Documents/workspace/huntsman/vev/lib/python3.6/site-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/Users/andy/Documents/workspace/huntsman/vev/lib/python3.6/site-packages/frontera/contrib/scrapy/schedulers/frontier.py", line 113, in process_spider_output
    self.frontier.page_crawled(response)  # removed frontier part from .meta
  File "/Users/andy/Documents/workspace/huntsman/vev/lib/python3.6/site-packages/frontera/utils/managers.py", line 33, in page_crawled
    self.manager.page_crawled(self.response_converter.to_frontier(response))
  File "/Users/andy/Documents/workspace/huntsman/vev/lib/python3.6/site-packages/frontera/core/manager.py", line 550, in page_crawled
    self.states_context.fetch()
  File "/Users/andy/Documents/workspace/huntsman/vev/lib/python3.6/site-packages/frontera/core/manager.py", line 815, in fetch
    self.states.fetch(self._fingerprints)
AttributeError: 'NoneType' object has no attribute 'fetch'

I attempted patching this part of the MessageBusBackend to return a no-op states object, but got a different error:

2018-07-28 09:39:26 [scrapy.core.scraper] ERROR: Spider error processing <GET http://data.webarchive.org.uk/crawl-test-site/> (referer: None)
Traceback (most recent call last):
  File "/Users/andy/Documents/workspace/huntsman/vev/lib/python3.6/site-packages/scrapy/utils/defer.py", line 102, in iter_errback
    yield next(it)
  File "/Users/andy/Documents/workspace/huntsman/vev/lib/python3.6/site-packages/scrapy/spidermiddlewares/offsite.py", line 30, in process_spider_output
    for x in result:
  File "/Users/andy/Documents/workspace/huntsman/vev/lib/python3.6/site-packages/scrapy/spidermiddlewares/referer.py", line 339, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "/Users/andy/Documents/workspace/huntsman/vev/lib/python3.6/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/Users/andy/Documents/workspace/huntsman/vev/lib/python3.6/site-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/Users/andy/Documents/workspace/huntsman/vev/lib/python3.6/site-packages/frontera/contrib/scrapy/schedulers/frontier.py", line 116, in process_spider_output
    self.frontier.links_extracted(response.request, links)
  File "/Users/andy/Documents/workspace/huntsman/vev/lib/python3.6/site-packages/frontera/utils/managers.py", line 38, in links_extracted
    links=frontier_links)
  File "/Users/andy/Documents/workspace/huntsman/vev/lib/python3.6/site-packages/frontera/core/manager.py", line 569, in links_extracted
    super(LocalFrontierManager, self).links_extracted_after(request, filtered)
  File "/Users/andy/Documents/workspace/huntsman/vev/lib/python3.6/site-packages/frontera/core/manager.py", line 292, in links_extracted_after
    links=filtered)
  File "/Users/andy/Documents/workspace/huntsman/vev/lib/python3.6/site-packages/frontera/core/manager.py", line 128, in _process_components
    return_classes=return_classes, **kwargs)
  File "/Users/andy/Documents/workspace/huntsman/vev/lib/python3.6/site-packages/frontera/core/manager.py", line 140, in _process_component
    return_obj = getattr(component, method_name)(*([obj] if obj else []), **kwargs)
  File "/Users/andy/Documents/workspace/huntsman/vev/lib/python3.6/site-packages/frontera/strategy/basic.py", line 17, in links_extracted
    if link.meta[b'state'] == States.NOT_CRAWLED:
KeyError: b'state'

So, I'm clearly not understanding how this should work!

sibiryakov commented 6 years ago

Hi @anjackson it's quite likely your code is missing LOCAL_MODE=False setting (http://frontera.readthedocs.io/en/latest/topics/frontera-settings.html#local-mode). Unfortunately, we've missed to include it in documentation. This will be fixed during next few days.

anjackson commented 6 years ago

Thanks @sibiryakov that did the trick!