scrapinghub / scrapyrt

HTTP API for Scrapy spiders
BSD 3-Clause "New" or "Revised" License
824 stars 161 forks source link

Deciding which spider to run based on arguments #126

Closed avlm closed 3 years ago

avlm commented 3 years ago

Hey! I would like to know if there's a way to write any custom resource or some, to select which spider to run, based on the argument received on request.

I saw this issue #29, and I'm planning to use the branch gdelfresno:pass-arguments-to-spider from PR #72 for that.

I think what @pawelmhm commented on issue will no longer be a problem, since my endpoint will be accessible only inside a kubernetes cluster and the requests are going to be made by another project.

pawelmhm commented 3 years ago

Hello @avlm yes you can do it. spider_name is required argument. This is because we must choose spider to run. If you don't pass spider name you get 400 Bad request HTTP status.

To bypass validation you need to override resource, create your resource and pass it to settings

# quoteconfig.py
RESOURCES = {
    'crawl.json': 'quotesbot.resources.PickSpider'
}   

And then add resource, this just copies and pastes code from default crawl resource and adds random.choice of spider name, we could improve it on our side to make things easier.

# quotesbot/resources.py
import random

from scrapyrt.resources import CrawlResource
from scrapyrt.utils import extract_scrapy_request_args

class PickSpider(CrawlResource):
    def render_GET(self, request, **kwargs):
        api_params = dict(
            (name.decode(), value[0].decode())
            for name, value in request.args.items()
        )
        scrapy_request_args = extract_scrapy_request_args(api_params,
                                                          raise_error=False)
        self.validate_options(scrapy_request_args, api_params)

        # pick spider name and add it to api params
        api_params['spider_name'] = random.choice(['toscrape-css', 'toscrape-xpath'])

        return self.prepare_crawl(api_params, scrapy_request_args, **kwargs)

Now when you run scrapyrt in your directory (my project is called quotesbot) you need to pass configuration with -S flag and it will do random choice of spider name as in my example.

> scrapyrt -S quoteconfig

https://scrapyrt.readthedocs.io/en/latest/api.html#configuration

avlm commented 3 years ago

Thanks @pawelmhm!