scrapy-plugins / scrapy-zyte-api

Zyte API integration for Scrapy
BSD 3-Clause "New" or "Revised" License
35 stars 19 forks source link

ZyteApiProvider could make an unneeded API request #91

Open kmike opened 1 year ago

kmike commented 1 year ago

In the example below ZyteApiProvide makes 2 API requests instead of 1:

@handle_urls("example.com")
@attrs.define
class MyPage(ItemPage[MyItem]):
    html: BrowserHtml
    # ...

class MySpider(scrapy.Spider):
    # ...
    def parse(self, response: DummyResponse, product: Product, my_item: MyItem):
        # ...
Gallaecio commented 1 year ago

Findings so far:

wRAR commented 1 year ago

Yeah, the problem AFAIK is that ItemProvider calls build_instances itself. https://github.com/scrapinghub/scrapy-poet/pull/151 is actually about a third request done in this or similar use case.

wRAR commented 1 year ago

We also thought the solution may involve the caching feature in ItemProvider but didn't investigate further.

Gallaecio commented 1 year ago

Indeed.

Gallaecio commented 1 year ago

New finding: Switching MyItem to MyPage works, even if there is still some level of indirection. Could explain why https://github.com/scrapinghub/scrapy-poet/pull/153 works.

BurnzZ commented 1 year ago

I looked into this further and it still occurs without any Page Objects involved.

The sent Zyte API requests were determined by setting ZYTE_API_LOG_REQUESTS=True.

Given the following spider:

class BooksSpider(scrapy.Spider):
    name = "books"

    def start_requests(self):
        yield scrapy.Request(
            url="https://books.toscrape.com",
            callback=self.parse_nav,
            meta={"zyte_api": {"browserHtml": True}},
        )

Case 1

βœ… The following callback set up is correct since it has only 1 request:

# {"productNavigation": true, "url": "https://books.toscrape.com"}
def parse_nav(self, response: DummyResponse, navigation: ProductNavigation):
    ...

Case 2

❌ However, the following has 2 separate requests:

# {"browserHtml": true, "url": "https://books.toscrape.com"}
# {"productNavigation": true, "url": "https://books.toscrape.com"}
def parse_nav(self, response, navigation: ProductNavigation):
    ...

This case should not happen since browserHtml and productNavigation can both be present in the same Zyte API Request.

Case 3

However, if we introduce a Page Object to the same spider:

@handle_urls("")
@attrs.define
class ProductNavigationPage(ItemPage[ProductNavigation]):
    response: BrowserResponse
    nav_item: ProductNavigation

    @field
    def url(self):
        return self.nav_item.url

    @field
    def categoryName(self) -> str:
        return f"(modified) {self.nav_item.categoryName}"

❌ Then, the following callback set up would have 3 separate Zyte API Requests:

# {"browserHtml": true, "url": "https://books.toscrape.com"}
# {"productNavigation": true, "url": "https://books.toscrape.com"}
# {"browserHtml": true, "url": "https://books.toscrape.com"}
def parse_nav(self, response: DummyResponse, navigation: ProductNavigation):
    ...

Note that the same series of 3 separate requests still occurs on:

def parse_nav(self, response, navigation: ProductNavigation):
    ...
Gallaecio commented 1 year ago

I wonder if some of the unexpected requests are related to https://github.com/scrapy-plugins/scrapy-zyte-api/issues/135.

BurnzZ commented 9 months ago

Re-opening this since Case 2 is still occurring. Case 3 has been fixed though.

wRAR commented 9 months ago

@BurnzZ so do you think after your latest analysis that case 2 still happens or not?

BurnzZ commented 9 months ago

@wRAR I can still reproduce Case 2. πŸ‘

wRAR commented 9 months ago

OK, so the difference between this use case and ones that we already test is having "browserHtml": True in meta. Currently the provider doesn't check this at all. It looks like it should? cc: @kmike

wRAR commented 9 months ago

OTOH I'm not sure if even we handle this in the provider the request itself won't be sent?

kmike commented 9 months ago

@wRAR Let's try to focus on how Case 2 (or any of these cases) affect https://github.com/zytedata/zyte-spider-templates, not on the case itself. The priority of supporting meta is not clear to me now; it may not be necessary in the end, or it could be.