scrapinghub / scrapinghub-entrypoint-scrapy

Scrapy entrypoint for Scrapinghub job runner
BSD 3-Clause "New" or "Revised" License
25 stars 16 forks source link

Missing `Parent Request #`, `Duration`, and `Response Size` fields #78

Open BurnzZ opened 7 months ago

BurnzZ commented 7 months ago

Currently, the requests coming from scrapy_zyte_api.providers.ZyteApiProvider doesn't create the Parent Request # field in Scrapy Cloud.

image

In the example above, Request 1 should have a Parent Request # field which is missing.

Note that when reverting the changes from the PR https://github.com/scrapinghub/scrapinghub-entrypoint-scrapy/pull/73/, we get the Parent Request # field back which comes from the other request which is filtered in the new scrapinghub-entrypoint-scrapy version.

It would seem that after filtering out one of the duplicate requests, the request.meta.setdefault(HS_PARENT_ID_KEY) should somehow be copied into the other request (code ref).

Reproducible example:

class ParentSpider(scrapy.Spider):
    name = "parent"

    def start_requests(self):
        yield scrapy.Request(
            url="https://books.toscrape.com",
            callback=self.parse_nav,
        )

    def parse_nav(self, response: DummyResponse, navigation: ProductNavigation):
        for request in navigation.items:
            yield request.to_scrapy(
                callback=self.parse_item,
            )

    def parse_item(self, response: DummyResponse, product: Product):
        yield product
BurnzZ commented 6 months ago

Note that Duration and Response Size are also missing.