Closed villeristi closed 1 year ago
Closing, read through the docs (which could be organized better)
Hi,
I've got the same issue. I looked into the doc but impossible to find an answer. Please can you help me @villeristi ? How did you solve the problem and can you provide a link to the documentation explaining the issue ?
Thanks in advance
Hi, same issue here. @mautini, did you happen to figure it out already?
The idea is Scrapy shouldn't be scheduling any links, only parsing and extracting. All the scheduling logic should be implemented in crawling strategy.
Example: https://github.com/scrapinghub/frontera/blob/master/examples/cluster/bc/spiders/bc.py
Hi @pdeboer,
I finally found a solution. As @sibiryakov mentioned, you must not provide links to Scrapy directly. So start by removing start_urls from your spiders.
Next, you must generate a backend to allow frontera to send url to fetch to scrapy. For this purpose, in your Frontera settings, change the backend to frontera.contrib.backends.sqlalchemy.Distributed
(apparently, the tutorial does not work with MemoryDistributedBackend
) and set SQLALCHEMYBACKEND_ENGINE = 'sqlite:///<<fileName>>.db'
to persist your backend status (and queues...) in a file. Instead it will be in memory and lost after the generation.
Now, generate the database using add_seeds script (step 6 here : https://frontera.readthedocs.io/en/latest/topics/quick-start-single.html?highlight=add%20seeds)
You can start the crawler, it should be working !
Hi all!
I am getting the same error as @villeristi initially: KeyError b'frontier_request
as a spider response processing error.
Quick setup explanation: mostly followed the distributed quickstart setup and config, scrapy with frontera and trying to use scrapy-selenium with it.
Compliant with @sibiryakov's example, the spider is also just yielding requests in the parse function, however we use the SeleniumRequest
from scrapy-selenium.
Requests are yielded in the parse()
function and in the start_requests()
function.
Are we also meant to avoid yielding requests in the start_requests function? Or could it be the SeleniumRequest causing it? Or is there more in configuration/settings that is crucial for this?
More details in https://github.com/scrapinghub/frontera/issues/401 (opened as I did not find this issue before, happy to move or close)
Thanks for all reactions and input! :)
Instructions on Official documentation about Using the Frontera with Scrapy throws exception with
CrawlSpider
.spider-code:
Exception thrown:
So, how would one use Frontera properly with existing Scrapy-project?
Cheers, this looks definitely awesome!