scrapinghub / scrapyrt

HTTP API for Scrapy spiders
BSD 3-Clause "New" or "Revised" License
824 stars 161 forks source link

Saving scraped items in a feed #147

Open runa opened 1 year ago

runa commented 1 year ago

Hi! thanks for your work on Scrapyrt!

I've discovered that spiders served by Scrapyrt don't save the output in the Spider's / custom_settings / FEEDS. Is it possible to change this behavior and make the spider served by Scrapyrt respect this setting?

Thanks!

pawelmhm commented 4 months ago

@runa can you add some sample code to reproduce this and add more details? I tested with this simple spider


import scrapy

class ToScrapeCSSSpider(scrapy.Spider):
    name = "toscrape-css"
    start_urls = [
        'http://quotes.toscrape.com/',
    ]
    custom_settings = {
        'FEEDS': {
            'items.json': {
                'format': 'json'
            }
        }
    }

    def parse(self, response):
        for quote in response.css("div.quote"):
            yield {
                'text': quote.css("span.text::text").extract_first(),
                'author': quote.css("small.author::text").extract_first(),
                'tags': quote.css("div.tags > a.tag::text").extract()
            }

        next_page_url = response.css("li.next > a::attr(href)").extract_first()
        if next_page_url is not None:
            yield scrapy.Request(response.urljoin(next_page_url))

and when scheduled with ScrapyRT

curl --location 'http://localhost:9080/crawl.json' \
--header 'Content-Type: application/json' \
--data '{
    "request": {
        "url": "https://quotes.toscrape.com/"
    },
    "spider_name": "toscrape-css"
}'

I see there is items.json file generated in filesystem of spider project. Is there some specific feed that is failing for you?