scrapinghub / scrapyrt

HTTP API for Scrapy spiders
BSD 3-Clause "New" or "Revised" License
832 stars 162 forks source link

Search Page returns empty through scrapyrt only #116

Open keyiyek opened 3 years ago

keyiyek commented 3 years ago

(Sorry can't find how to label this) I hope this is the right place where to ask this.

I created a spider that can scrape a page in an e-commerce site and gather the data on the different items. The spider works fine with specific pages of the site (www.sitedomain/123-item-category), as well as with the search page (www.sitedomain/searchpage?controller?search=keywords+item+to+be+found).

But, when I run it through scrapyrt the specific page works fine, but the search page returns 0 items. No errors, just 0 items.This occurs on 2 different sites with 2 different spiders.

Is there something specific to search pages that has to be taken in account when using scrapyrt?

pawelmhm commented 3 years ago

Can you post your spider code? I don't see a way to reproduce it without spider code. Try to pinpoint the problem so that there is small code sample of spider running in raw ScrapyRT (without any middlewares, pipelines and other stuff from your project intefering). This way we can see this is problem on ScrapyRT side.

keyiyek commented 3 years ago

yes, sure.

so, my spider, stripped of all other suff looks like this:

`import scrapy

class QuotesSpider(scrapy.Spider): name = "minimal"

def start_requests(self):
    urls = [
       "https://www.dungeondice.it/ricerca?controller=search&s=ticket+to+ride",
    ]
    for url in urls:
        yield scrapy.Request(url=url, callback=self.parse)

def parse(self, response):
    print("Found ", len(response.css("article")), " items")
    for article in response.css("article"):
        print("Item: ", [article.css("img::attr(title)").get())`]

and I set Obey_robots = False

when I do

scrape crawl minimal

I get 20 items in the response, but if I go

curl "http://localhost:9081/crawl.json?spider_name=minimal&url=https://www.dungeondice.it/ricerca?controller=search&s=ticket+to+ride"

I get 0 items, no error, just 0 items. I wonder if, in some way, returns the results before the page gets completely loaded?

(sorry couldn't get the markup to work correctly)

Yansuko commented 2 years ago

Seems that when there is '&' on the url. scrapyrt split it right before the &.