scrapy-plugins / scrapy-playwright

🎭 Playwright integration for Scrapy
BSD 3-Clause "New" or "Revised" License
911 stars 101 forks source link

Contracts and testing best practices with Scrapy-Playwright #262

Closed riordan closed 3 months ago

riordan commented 4 months ago

I'm working on a project with a large number of spiders and a fairly sizable community of open contributors. To ensure that things are working correctly over the long-term, I like the idea of using the Contract conventions on our spiders.

That said, with scrapy-playwright, most calls to the browser benefit from being async, and contracts and so I'm encountering quite a few TypeError: Contracts don't support async callbacks.

Any thoughts on good testing practices for playwright spiders?

elacuesta commented 3 months ago

I'd suggest you to request the ability to use contracts with async callbacks to upstream Scrapy. In the past I've used scrapy-autounit, although I haven't tested it with async callbacks so it might also not work for your case.

Finally, one (maybe hacky) way could be to emulate the scrapy-playwright tests themselves. See for instance make_handler, which creates a ScrapyPlaywrightDownloadHandler object that can be used to directly retrieve a response to be processed with your callback to check if the result is what you expect (similar to this test). This is not ideal because you'd be using a private handler method, I don't have any plans to change this but just be advised. I think this could be optimized by recording responses in HAR the first time (see the record_har_path & record_har_content arguments for Browser.new_context, which you can handle via the PLAYWRIGHT_CONTEXTS setting), and then routing either the context or the page to use the HAR to serve the responses (should be possible with a page init callback).

For the record, I haven't tried any of this myself, I'm just thinking out loud. Perhaps this is something that could be more easily integrated, I'm open to suggestions.

riordan commented 3 months ago

Thank you, @elacuesta! This is excellent context (and fantastic documentation).

Based on a cursory readthrough of scrapy-autounit, the way it implements callbacks is promising :crossed_fingers:.

Sadly, I wound up abandoning contracts for the above scrapy project; it turns out that when you're using attrs or dataclasses based items, the contract behaves as if all fields are being returned :thinking:. Autounit might wind up working out better overall.

Closing for now, but will update once I've tried autounit on the playwright spiders.