scrapinghub / frontera

A scalable frontier for web crawlers
BSD 3-Clause "New" or "Revised" License
1.29k stars 215 forks source link

The `KeyError` throw when running to to_fetch in StateContext class: b'fingerprint' #404

Open yujiaao opened 4 years ago

yujiaao commented 4 years ago

https://github.com/scrapinghub/frontera/blob/master/frontera/core/manager.py I use 0.8.1 code base in LOCAL_MODE, The KeyError throw when running to to_fetch in StateContext class:

from line 801:

class StatesContext(object):
    ...
    def to_fetch(self, requests):
        requests = requests if isinstance(requests, Iterable) else [requests]
        for request in requests:
            fingerprint = request.meta[b'fingerprint'] # error occured here!!!

I think the reason is the meta b'fingerprint' used before it's setting:

from line 302:

class LocalFrontierManager(BaseContext, StrategyComponentsPipelineMixin, BaseManager):
    def page_crawled(self, response):
...
        self.states_context.to_fetch(response)  # here used  b'fingerprint'
        self.states_context.fetch()
        self.states_context.states.set_states(response)
        super(LocalFrontierManager, self).page_crawled(response) # but only here init!
        self.states_context.states.update_cache(response)

from line 233:

class BaseManager(object):          
    def page_crawled(self, response):
...
        self._process_components(method_name='page_crawled',
                                 obj=response,
                                 return_classes=self.response_model) # b'fingerprint' will be set when pipeline go through here

My corrent work aroud is add the line to to_fetch method of StateContext class:

    def to_fetch(self, requests):
        requests = requests if isinstance(requests, Iterable) else [requests]
        for request in requests:
            if b'fingerprint' not in request.meta:                
                request.meta[b'fingerprint'] = sha1(request.url)
            fingerprint = request.meta[b'fingerprint']
            self._fingerprints[fingerprint] = request

What is the collect way to fix this?