Some discrepancies in request/response stats

BurnzZ commented 1 year ago

Starting on scrapy-zyte-api >= 0.9.0, we're now able request for item extraction in Zyte API directly in the spider:

import scrapy
from scrapy_poet import DummyResponse
from zyte_common_items import Product

class SampleSpider(scrapy.Spider):
    name = "sample"

    def start_requests(self):
        yield scrapy.Request(
            "http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html",
            self.parse_product,
        )

    def parse_product(
        self,
        response: DummyResponse,
        product: Product,
    ):
        yield product

In the Scrapy crawl stats, we have these:

{
 ...
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 2,
 'response_received_count': 2,
 'scrapy-zyte-api/attempts': 1,
 'scrapy-zyte-api/status_codes/200': 1,
 'scrapy-zyte-api/success': 1,
 'scrapy_poet/dummy_response_count': 1,
 ...
}

These are kind of correct since DummyResponse is technically a response so our spider would have received 2 responses (i.e. DummyResponse and Product). Although I'd probably prefer to have a new downloader/dummy_response_count stat in lieu of the existing scrapy_poet/dummy_response_count just so that they are closer to one another (which helps when visually inspecting them):

{
 ...
 'downloader/dummy_response_count': 1,
 'downloader/response_count': 2,
 ...
}

Another discrepancy is in Scrapy Cloud since it would appear that we have 2 requests:

It's because of scrapinghub-entrypoint-scrapy.sh_scrapy.HubstorageDownloaderMiddleware (ref) records the requests in the Request-tab based on the response received.

def process_response(self, request, response, spider):
    self.pipe_writer.write_request(
        url=response.url,
        status=response.status,
        method=request.method,
        rs=len(response.body),
        duration=request.meta.get('download_latency', 0) * 1000,
        parent=request.meta.setdefault(HS_PARENT_ID_KEY),
        fp=request_fingerprint(request),
    )
    ...

An option would be to update https://github.com/scrapinghub/scrapinghub-entrypoint-scrapy to record the Requests based on processing the actual requests instead of responses. This seems quite straightforward though I'm not sure what's the reasoning of the original implementation.

kmike commented 1 year ago

based on processing the actual requests instead of responses

Information includes "response size", so it seems it requires the response.

PyExplorer commented 1 year ago

These issues:

duplicated requests in Tab Requests
duplicated counter for responses in Scrapy crawl stats.

were reproduced with the spider (from the first comment)

Locally (additionally turned on sh_scrapy.middlewares.HubstorageDownloaderMiddleware).
Scrapy Cloud. Test job for SC: https://app.zyte.com/p/683642/4/6/requests

The first thing that was noticed is that these issues happen only when we use Item (like Product) here:

    def parse_product(
        self,
        response: DummyResponse,
        product: Product,
    )

With ProductPage (for example) - no issues, and all calculated correctly, because DummyResponse is not created in this case. Test job https://app.zyte.com/p/683642/4/7/requests.

At this moment I'm thinking about these approaches that can help to fix the issues.

Requests' duplication in tab "Requests". The main issue, as was stated before, is the calculation for requests in process_response. The straightforward way is to skip this calculation for DummyResponse.
It can be done several ways:

an easiest way - check name of class, like

def process_response(self, request, response, spider):
if type(response).__name__=="DummyResponse":
    return response
self.pipe_writer.write_request(
    url=response.url,
    status=response.status,
    ...

by import DummyResponse and check it:

def process_response(self, request, response, spider):
if isinstance(response, DummyResponse):
    return response
self.pipe_writer.write_request(
    url=response.url,
    status=response.status,
    ...

Cons: This approach requires adding scrapy_poet to the scrapinghub-entrypoint-scrapy repo.

to send some parameter in DummyResponse here https://github.com/scrapinghub/scrapy-poet/blob/master/scrapy_poet/downloadermiddlewares.py#L89 (in meta / cb_kwargs / flags) For example:

def process_request(self, request: Request, spider: Spider) -> Optional[DummyResponse]:
    ...
    self.crawler.stats.inc_value("scrapy_poet/dummy_response_count")
    dummy_response = DummyResponse(url=request.url, request=request)
    dummy_response.cb_kwargs.update({"skip_for_scrapy_cloud_stats": True})
    return dummy_response

And check it in process_response later:

def process_response(self, request, response, spider):
    try: 
        if response.cb_kwargs.get("skip_for_scrapy_cloud_stats"):
            return response
    except AttributeError:
        ...

    self.pipe_writer.write_request(
        url=response.url,
        status=response.status,
        ...

Cons: This approach requires changes in scrapy_poet.

I've tested these approaches - all works well.

For duplicated counter in Scrapy crawl stats. The approaches are the same - check DummyResponse or special parameter (meta or cb_kwargs) here https://github.com/scrapy/scrapy/blob/master/scrapy/downloadermiddlewares/stats.py#L41. And instead of changing scrapy - we can create custom stats middleware based on DownloaderStats or use statscollector.

@kmike @BurnzZ could you take a look if it is ok to go any of these ways or do I need to find something else?

Gallaecio commented 1 year ago

We don’t need to make scrapy-poet a dependency, we only need to try to import from it, and don’t run the code if that did not work.

In the long run, the plan is not to have DummyResponse at all. Until that happens, I think modifying scrapinghub-entrypoint-scrapy may not be a bad idea.

BurnzZ commented 1 year ago

Having the following in https://github.com/scrapinghub/scrapinghub-entrypoint-scrapy/blob/master/sh_scrapy/middlewares.py#L62-L71 seems to be the easiest way indeed:

    def process_response(self, request, response, spider):
        if type(response).__name__ == "DummyResponse":
            return response
        self.pipe_writer.write_request(...)

No need to import scrapy-poet.
I don't think the name DummyResponse would ever change. I also don't think it would be subclassed so having isinstance() checks won't be needed.

For the other approach, I'm not quite keen on the idea of passing a "skip_for_scrapy_cloud_stats" flag on the request-response pairs since it may get lost along the way (e.g. some other middleware messed with it).

PyExplorer commented 1 year ago

Thanks @Gallaecio @BurnzZ! I'll test this approach then

def process_response(self, request, response, spider):
        if type(response).__name__ == "DummyResponse":
            return response
        self.pipe_writer.write_request(...)

PyExplorer commented 1 year ago

PR https://github.com/scrapinghub/scrapinghub-entrypoint-scrapy/pull/73

Gallaecio commented 10 months ago

Did https://github.com/scrapinghub/scrapinghub-entrypoint-scrapy/pull/73 fix this issue, or are there missing issues that need to be addressed here?

BurnzZ commented 10 months ago

I'm still seeing the same issue in my SC jobs this week. Perhaps the new https://github.com/scrapinghub/scrapinghub-entrypoint-scrapy version with the fix was not deployed in all SC projects?

Gallaecio commented 10 months ago

Are you using stack 2.11?

BurnzZ commented 10 months ago

Yes. What's interesting is that it happens sporadically. Sometimes I get the issue, sometimes I don't. Tried in both staging and production envs. I'll try to report back when it happens again and try to reproduce it.

EDIT: The SC Request-Tab is stil happening sporadically but the stats problem still occurs.

Gallaecio commented 10 months ago

If you add scrapinghub-entrypoint-scrapy with a frozen version to your requirements, do things work, or does it also fail sporadically?

scrapy-plugins / scrapy-zyte-api

Some discrepancies in request/response stats #112