scrapinghub / scrapinghub-entrypoint-scrapy

Scrapy entrypoint for Scrapinghub job runner
BSD 3-Clause "New" or "Revised" License
25 stars 16 forks source link

sh_scrapy.extension - Wrong item type: None #61

Closed thekage91 closed 3 years ago

thekage91 commented 3 years ago

I'm trying to start my CrawlerSpider on Zyte but I have a very annoying error

[sh_scrapy.extension] Wrong item type: None I have followed the documentation to create a crawler through which extract all links in a specific web page but when I start the job on zyte the Scraper send correctly the request but immediately return me the error.

The code to create this CrawlerSpider is very simple and minimal, this is the part responsabile to create the Rules and the LinkExtractor instance:

self.restrict_css = [self.selector_item]
if self.selector_next_page:
    self.restrict_css.append(self.selector_next_page)

self.rules = (
    Rule(
        LinkExtractor(
            deny_extensions=["css", "js"],
            unique=True,
            restrict_css=self.restrict_css,
            process_value=lambda value: check_noindex_nofollow(value),
        ),
        process_links="ignore_nofollow_noindex",
        callback="parse",
        follow=True,
    ),
)

Basically

self.restrict_css = [self.selector_item]
if self.selector_next_page:
    self.restrict_css.append(self.selector_next_page)

Create an array with one element if the site doesn't have a next_page or two element if the site has a next_page. This array is useful to limit the crawling only in a specific part of the site, indeed:

restrict_css=self.restrict_css, Do this.

The parse is:

def parse(
    self,
    response,
):
    item = ItemLoader(item=PageLink(), response=response)
    item.add_css("name", "title::text")
    item.add_value("url", response.url)
    item.add_css("image", "img::attr(src)")
    item.add_value("depth", response.meta["depth"])
    item.add_value("timestamp", self.timestamp)
    yield item.load_item()

PageLink is a scrapy.Item declared in a specific file and imported into the CrawlerSpider, so the class know about it.

If I start the scraper with a specific link and a specific css rules, immediately return me [sh_scrapy.extension] Wrong item type: None after sending the requests, I don't know why. The only thing I founded is the line code that fire this error

Does Anyone have experienced this before? How can I resolve this very annoying problem?

Thank you very much

thekage91 commented 3 years ago

The problem was caused by a MongoPipeline that didn't return the item into the process_item function. Sorry for the issue