scrapy-plugins / scrapy-playwright

🎭 Playwright integration for Scrapy
BSD 3-Clause "New" or "Revised" License
1.03k stars 113 forks source link

Does locator.click() use scrapy custom headers? #300

Closed blacksteel1288 closed 4 months ago

blacksteel1288 commented 4 months ago

I'm using a scrapy middleware 'RotateAgentMiddleware' to set request headers, which seems to work fine in most cases.

However, I'm noticing that when I use the .click() method on a locator (e.g. page.locator('div.something').click()) inside of my parse method, it appears that these custom middleware headers are not being used. I determined this by debugging with "headless: False", setting a breakpoint immediately before the click(), then using chrome dev tools in the visible browser to see the request headers.

Is that correct (that the scrapy headers are not used in the .click())? If so, is there a way I can send my custom headers with any .click() method?

Thank you!

elacuesta commented 4 months ago

That is correct, network operations resulting from manipulating the page once you're in the callback are processed directly by Playwright, bypassing the Scrapy workflow[1]. This is briefly explained here for PageMethod objects, the same applies to this case. Thanks for noticing it, I will add it to the README. Regarding modifying headers, see the docs for the PLAYWRIGHT_PROCESS_REQUEST_HEADERS setting.

[1] they can still be intercepted with the PLAYWRIGHT_PROCESS_REQUEST_HEADERS and PLAYWRIGHT_ABORT_REQUEST settings

blacksteel1288 commented 4 months ago

Ok, that explains it.

The documentation for PLAYWRIGHT_PROCESS_REQUEST_HEADERS seems a little confusing to me. It appears that I can use that to set the same headers I'm using with Scrapy, which I assume would persist for any page methods? e.g. PLAYWRIGHT_PROCESS_REQUEST_HEADERS = custom_headers where I copy all the header values I'm using.

But, the part where it says setting PLAYWRIGHT_PROCESS_REQUEST_HEADERS=None says only headers set by Playwright will be sent.

Is that not what is happening in the first part of the documentation where I'm setting my own custom headers? What is the difference?

Thank you!

elacuesta commented 4 months ago

scrapy-playwright uses Route.continue_ internally to modify requests, and PLAYWRIGHT_PROCESS_REQUEST_HEADERS controls whether or not and how the optional headers argument is passed. There are 3 possibilities for PLAYWRIGHT_PROCESS_REQUEST_HEADERS:

  1. scrapy_playwright.headers.use_scrapy_headers, the default value, code available here. As explained in the docstring, this populates the headers with whatever is set by Scrapy (e.g. middlewares, both built-in and user-defined) but only for navigation requests.
  2. None. Do not override headers, let Playwright handle them.
  3. A user-defined function. For instance, if you wanted to set your own headers for all requests, both navigation and background.
blacksteel1288 commented 4 months ago

Got it, this is clearer now.

For 3, is the user-defined function called once or multiple times inside each scrapy.Request? i.e. are the headers set for the entire Request?

For example, if I'm using a random function to select a header from a list of headers in the user-defined function, could that be changing values multiple times inside of a given scrapy.Request?

elacuesta commented 4 months ago

For each Scrapy request there is at least one Playwright request (for the main URL), but it's possible that more Playwright requests are generated to retrieve additional assets (images, style sheets, etc). The function will be called for all of those generated Playwright requests. I've tried to clarify the behavior on a816f86046e967679bf9ae3faeece448dcefb53e.

For 3, is the user-defined function called once or multiple times inside each scrapy.Request? i.e. are the headers set for the entire Request?

The idea of this feature is to allow modifying Playwright requests, not Scrapy ones.

blacksteel1288 commented 4 months ago

I think that is what I'm seeing in my logging, that the custom_headers function is being called many times for a single scrapy.Request. So, the random function is randomizing every time.

Ideally, what I would want is to set the custom_headers once for the entire scrapy.Request, including the mail URL and any following requests made in Playwright. Although, it isn't obvious to me how to do that, other than with constant values as in the example shown. Or, is there a way to do this?

elacuesta commented 4 months ago

One idea that comes to mind is setting a header in the Scrapy request (e.g. when producing the request in the spider or in a middleware) and then picking up that header in the header processing function. That way the header will have the same value for all Playwright requests generated for each Scrapy request.

class MySpider(scrapy.Spider):
    def start_requests(self):
        yield scrapy.Request("https://example.org", headers={"asdf": "qwerty"})
async def custom_headers(
    browser_type: str,
    playwright_request: PlaywrightRequest,
    scrapy_headers: Headers,
) -> dict:
    scrapy_headers_str = scrapy_headers.to_unicode_dict()
    playwright_headers = await playwright_request.all_headers()
    playwright_headers["asdf"] = scrapy_headers_str["asdf"]
    return playwright_headers

See also #303.

blacksteel1288 commented 4 months ago

Thanks, that works!