crawspider in scrapyrt - Githubissues

NanZhang715 commented 9 months ago

Hi,

I built a spider with CrawlSpider class

when using curl "localhost:9080/crawl.json?spider_name=web_spider&url=https://www.example.com", the URL is not passed to the spider, the error log shows that "start_urls is [[]]"" but cmd shown below works well scrapy runspider web_spider.py -a url=https://www.example.com

class webSpider(CrawlSpider):
    name = "web_spider"
    start_urls = []

    def __init__(self, *args, **kwargs):

        self.start_urls = [kwargs.get('url', [])]
        print('start_urls', self.start_urls)
    rules = (
        Rule(LinkExtractor(), callback='parse', follow=True, errback='error_back'),
        # Rule(LinkExtractor(), callback='parse', follow=False, process_request=set_playwright_true),
        )
def parse(self, response):
  pass

Does scrapyrt support crawspider ?

Thanks

pawelmhm commented 9 months ago

it should support any spider, in your case you rely on executing start_requests which are disabled by default, but you can enable them: see here in docs:

https://scrapyrt.readthedocs.io/en/latest/api.html#scrapyrt-http-api

Whether spider should execute Scrapy.Spider.start_requests method. start_requests are executed by default when you run Scrapy Spider normally without ScrapyRT, but this method is NOT executed in API by default. By default we assume that spider is expected to crawl ONLY url provided in parameters without making any requests to start_urls defined in Spider class. start_requests argument overrides this behavior. If this argument is present API will execute start_requests Spider method.

NanZhang715 commented 9 months ago

Thank you, the issue is solved by passing the url with crawl_args

scrapinghub / scrapyrt

crawspider in scrapyrt #154