scrapinghub / frontera

A scalable frontier for web crawlers
BSD 3-Clause "New" or "Revised" License
1.29k stars 215 forks source link

Error processing robots.txt - no body #395

Closed Prometheus3375 closed 4 years ago

Prometheus3375 commented 4 years ago

I am using discovery strategy. This is the database of DB worker after adding seeds, other tables are empty. image After the moment when strategy worker, spider and db worker are started, crawling begins. Since robots.txt are scheduled, they are processed first. Unfortunately, response of robots.txt has body equals to None for some reason.

2020-06-10 00:09:18,717 DEBUG    strategy-worker Page crawled https://www.afp.com/robots.txt
2020-06-10 00:09:18,717 DEBUG    discovery       PC https://www.afp.com/robots.txt [200] (seed: https://www.afp.com/fr)
2020-06-10 00:09:18,718 ERROR    strategy-worker Exception during processing
Traceback (most recent call last):
  File "E:\Workspace\Projects\Parsing\env3.7\lib\site-packages\frontera\worker\strategy.py", line 55, in process
    self._on_page_crawled(response)
  File "E:\Workspace\Projects\Parsing\env3.7\lib\site-packages\frontera\worker\strategy.py", line 120, in _on_page_crawled
    self.strategy.page_crawled(response)
  File "E:\Workspace\Projects\Parsing\env3.7\lib\site-packages\frontera\strategy\discovery\__init__.py", line 244, in page_crawled
    self._process_robots_txt(response, domain)
  File "E:\Workspace\Projects\Parsing\env3.7\lib\site-packages\frontera\strategy\discovery\__init__.py", line 312, in _process_robots_txt
    body = response.body.decode('utf-8')  # TODO: use encoding from response.meta.get(b'encoding', 'utf-8')
AttributeError: 'NoneType' object has no attribute 'decode'

Such errors are raised for all 3 seeds. What is the reason of such behaviour?

Prometheus3375 commented 4 years ago

Here is the project archive. Now db and strategy workers share the same database. Even if they do not share, the issue is still present.

Prometheus3375 commented 4 years ago

The body is None because STORE_CONTENT is false by default.

I think this should be mentioned in Discovery strategy description.