scrapy / scrapyd

A service daemon to run Scrapy spiders
https://scrapyd.readthedocs.io/en/stable/
BSD 3-Clause "New" or "Revised" License
2.98k stars 570 forks source link

Use Scrapyd with arguments #133

Closed ricoxor closed 8 years ago

ricoxor commented 8 years ago

I'm using Scrapyd to use scrapy as webservice.

I would like to use the curl command with parameters like this :

curl http://myip:6800/schedule.json -d project=default -d spider=myspider -d domain=www.google.fr

But I don't know how to get the parameter domain in the Crawler.

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class MyItem(Item):
    url = Field()

class HttpbinSpider(CrawlSpider):

    name = "expired"
    start_urls = [domain]

I need to pass sometimes one domain or multiples in arguments.

Thank's !

Digenis commented 8 years ago

It's not possible due to a missing feature in scrapy. Users typically workaround this by serializing the arguments to curl and then unserializing in the spider's __init__().

E.g. curl http://myip:6800/schedule.json -d project=default -d spider=myspider -d domains='["www1.example.com", "www2.example.com"]'

class MySpider(Spider):
    def __init__(self, domains=None):
        domains = json.loads(domains)
        # do something with domains...

Another scenario would be to use the pickle module which can allow you to pass python objects to your spider but makes dumping the curl arguments more complicated.

Closing because it dublicates #61 and the issue tracker is not for support. Ask for help in the community if my answer didn't cover you.