Closed blacksteel1288 closed 4 months ago
That is correct, network operations resulting from manipulating the page once you're in the callback are processed directly by Playwright, bypassing the Scrapy workflow[1]. This is briefly explained here for PageMethod
objects, the same applies to this case. Thanks for noticing it, I will add it to the README.
Regarding modifying headers, see the docs for the PLAYWRIGHT_PROCESS_REQUEST_HEADERS
setting.
[1] they can still be intercepted with the PLAYWRIGHT_PROCESS_REQUEST_HEADERS and PLAYWRIGHT_ABORT_REQUEST settings
Ok, that explains it.
The documentation for PLAYWRIGHT_PROCESS_REQUEST_HEADERS seems a little confusing to me. It appears that I can use that to set the same headers I'm using with Scrapy, which I assume would persist for any page methods? e.g. PLAYWRIGHT_PROCESS_REQUEST_HEADERS = custom_headers
where I copy all the header values I'm using.
But, the part where it says setting PLAYWRIGHT_PROCESS_REQUEST_HEADERS=None
says only headers set by Playwright will be sent.
Is that not what is happening in the first part of the documentation where I'm setting my own custom headers? What is the difference?
Thank you!
scrapy-playwright uses Route.continue_
internally to modify requests, and PLAYWRIGHT_PROCESS_REQUEST_HEADERS
controls whether or not and how the optional headers
argument is passed.
There are 3 possibilities for PLAYWRIGHT_PROCESS_REQUEST_HEADERS
:
scrapy_playwright.headers.use_scrapy_headers
, the default value, code available here. As explained in the docstring, this populates the headers with whatever is set by Scrapy (e.g. middlewares, both built-in and user-defined) but only for navigation requests.None
. Do not override headers, let Playwright handle them.Got it, this is clearer now.
For 3, is the user-defined function called once or multiple times inside each scrapy.Request? i.e. are the headers set for the entire Request?
For example, if I'm using a random function to select a header from a list of headers in the user-defined function, could that be changing values multiple times inside of a given scrapy.Request?
For each Scrapy request there is at least one Playwright request (for the main URL), but it's possible that more Playwright requests are generated to retrieve additional assets (images, style sheets, etc). The function will be called for all of those generated Playwright requests. I've tried to clarify the behavior on a816f86046e967679bf9ae3faeece448dcefb53e.
For 3, is the user-defined function called once or multiple times inside each scrapy.Request? i.e. are the headers set for the entire Request?
The idea of this feature is to allow modifying Playwright requests, not Scrapy ones.
I think that is what I'm seeing in my logging, that the custom_headers
function is being called many times for a single scrapy.Request
. So, the random function is randomizing every time.
Ideally, what I would want is to set the custom_headers
once for the entire scrapy.Request
, including the mail URL and any following requests made in Playwright. Although, it isn't obvious to me how to do that, other than with constant values as in the example shown. Or, is there a way to do this?
One idea that comes to mind is setting a header in the Scrapy request (e.g. when producing the request in the spider or in a middleware) and then picking up that header in the header processing function. That way the header will have the same value for all Playwright requests generated for each Scrapy request.
class MySpider(scrapy.Spider):
def start_requests(self):
yield scrapy.Request("https://example.org", headers={"asdf": "qwerty"})
async def custom_headers(
browser_type: str,
playwright_request: PlaywrightRequest,
scrapy_headers: Headers,
) -> dict:
scrapy_headers_str = scrapy_headers.to_unicode_dict()
playwright_headers = await playwright_request.all_headers()
playwright_headers["asdf"] = scrapy_headers_str["asdf"]
return playwright_headers
See also #303.
Thanks, that works!
I'm using a scrapy middleware 'RotateAgentMiddleware' to set request headers, which seems to work fine in most cases.
However, I'm noticing that when I use the
.click()
method on a locator (e.g.page.locator('div.something').click()
) inside of my parse method, it appears that these custom middleware headers are not being used. I determined this by debugging with"headless: False"
, setting a breakpoint immediately before theclick()
, then using chrome dev tools in the visible browser to see the request headers.Is that correct (that the scrapy headers are not used in the
.click()
)? If so, is there a way I can send my custom headers with any.click()
method?Thank you!