Open kmike opened 1 year ago
Findings so far:
Yeah, the problem AFAIK is that ItemProvider calls build_instances itself. https://github.com/scrapinghub/scrapy-poet/pull/151 is actually about a third request done in this or similar use case.
We also thought the solution may involve the caching feature in ItemProvider but didn't investigate further.
New finding: Switching MyItem
to MyPage
works, even if there is still some level of indirection. Could explain why https://github.com/scrapinghub/scrapy-poet/pull/153 works.
I looked into this further and it still occurs without any Page Objects involved.
The sent Zyte API requests were determined by setting ZYTE_API_LOG_REQUESTS=True
.
Given the following spider:
class BooksSpider(scrapy.Spider):
name = "books"
def start_requests(self):
yield scrapy.Request(
url="https://books.toscrape.com",
callback=self.parse_nav,
meta={"zyte_api": {"browserHtml": True}},
)
β The following callback set up is correct since it has only 1 request:
# {"productNavigation": true, "url": "https://books.toscrape.com"}
def parse_nav(self, response: DummyResponse, navigation: ProductNavigation):
...
β However, the following has 2 separate requests:
# {"browserHtml": true, "url": "https://books.toscrape.com"}
# {"productNavigation": true, "url": "https://books.toscrape.com"}
def parse_nav(self, response, navigation: ProductNavigation):
...
This case should not happen since browserHtml
and productNavigation
can both be present in the same Zyte API Request.
However, if we introduce a Page Object to the same spider:
@handle_urls("")
@attrs.define
class ProductNavigationPage(ItemPage[ProductNavigation]):
response: BrowserResponse
nav_item: ProductNavigation
@field
def url(self):
return self.nav_item.url
@field
def categoryName(self) -> str:
return f"(modified) {self.nav_item.categoryName}"
β Then, the following callback set up would have 3 separate Zyte API Requests:
# {"browserHtml": true, "url": "https://books.toscrape.com"}
# {"productNavigation": true, "url": "https://books.toscrape.com"}
# {"browserHtml": true, "url": "https://books.toscrape.com"}
def parse_nav(self, response: DummyResponse, navigation: ProductNavigation):
...
Note that the same series of 3 separate requests still occurs on:
def parse_nav(self, response, navigation: ProductNavigation):
...
I wonder if some of the unexpected requests are related to https://github.com/scrapy-plugins/scrapy-zyte-api/issues/135.
Re-opening this since Case 2 is still occurring. Case 3 has been fixed though.
@BurnzZ so do you think after your latest analysis that case 2 still happens or not?
@wRAR I can still reproduce Case 2. π
OK, so the difference between this use case and ones that we already test is having "browserHtml": True
in meta
. Currently the provider doesn't check this at all. It looks like it should? cc: @kmike
OTOH I'm not sure if even we handle this in the provider the request itself won't be sent?
@wRAR Let's try to focus on how Case 2 (or any of these cases) affect https://github.com/zytedata/zyte-spider-templates, not on the case itself. The priority of supporting meta is not clear to me now; it may not be necessary in the end, or it could be.
In the example below ZyteApiProvide makes 2 API requests instead of 1: