scrapinghub / frontera

A scalable frontier for web crawlers
BSD 3-Clause "New" or "Revised" License
1.29k stars 216 forks source link

How to configure with Scrapy CrawlSpider #344

Closed villeristi closed 1 year ago

villeristi commented 6 years ago

Instructions on Official documentation about Using the Frontera with Scrapy throws exception with CrawlSpider.

spider-code:

class TestSpider(CrawlSpider):
    name = 'testspider'
    start_urls = ['https://example.com']
    rules = (
        Rule(LinkExtractor(), callback='parse_item', follow=True),
    )

    def parse_item(self, response):
      # some code here...
      pass

Exception thrown:

File "/usr/local/lib/python3.6/site-packages/frontera/contrib/scrapy/schedulers/frontier.py", line 112, in process_spider_output
    frontier_request = response.meta[b'frontier_request']
KeyError: b'frontier_request'

So, how would one use Frontera properly with existing Scrapy-project?

Cheers, this looks definitely awesome!

villeristi commented 6 years ago

Closing, read through the docs (which could be organized better)

mautini commented 6 years ago

Hi,

I've got the same issue. I looked into the doc but impossible to find an answer. Please can you help me @villeristi ? How did you solve the problem and can you provide a link to the documentation explaining the issue ?

Thanks in advance

pdeboer commented 5 years ago

Hi, same issue here. @mautini, did you happen to figure it out already?

sibiryakov commented 5 years ago

The idea is Scrapy shouldn't be scheduling any links, only parsing and extracting. All the scheduling logic should be implemented in crawling strategy.

Example: https://github.com/scrapinghub/frontera/blob/master/examples/cluster/bc/spiders/bc.py

mautini commented 5 years ago

Hi @pdeboer,

I finally found a solution. As @sibiryakov mentioned, you must not provide links to Scrapy directly. So start by removing start_urls from your spiders.

Next, you must generate a backend to allow frontera to send url to fetch to scrapy. For this purpose, in your Frontera settings, change the backend to frontera.contrib.backends.sqlalchemy.Distributed (apparently, the tutorial does not work with MemoryDistributedBackend) and set SQLALCHEMYBACKEND_ENGINE = 'sqlite:///<<fileName>>.db' to persist your backend status (and queues...) in a file. Instead it will be in memory and lost after the generation.

Now, generate the database using add_seeds script (step 6 here : https://frontera.readthedocs.io/en/latest/topics/quick-start-single.html?highlight=add%20seeds)

You can start the crawler, it should be working !

dkipping commented 4 years ago

Hi all!

I am getting the same error as @villeristi initially: KeyError b'frontier_request as a spider response processing error.

Quick setup explanation: mostly followed the distributed quickstart setup and config, scrapy with frontera and trying to use scrapy-selenium with it.

Compliant with @sibiryakov's example, the spider is also just yielding requests in the parse function, however we use the SeleniumRequest from scrapy-selenium.

Requests are yielded in the parse() function and in the start_requests() function.

Are we also meant to avoid yielding requests in the start_requests function? Or could it be the SeleniumRequest causing it? Or is there more in configuration/settings that is crucial for this?

More details in https://github.com/scrapinghub/frontera/issues/401 (opened as I did not find this issue before, happy to move or close)

Thanks for all reactions and input! :)