Open Yomguithereal opened 1 year ago
Hi.Yomguithereal . I got a similar question. Is it possible to the pages with different contexts all open in one playwright instance (one browser window) ? I need to post a lot of different queries for the same website and using different proxies to avoid been detected. It will consumes a lot of system resources if different contexts opened in different browser windows. Thanks
Always consider that playwright is primarily a testing framework, not a web scraping framework.
Your question
Hello playwright team,
I know the consensus about multithreading & playwright is that you should create one playwright instance per thread because playwright is not threadsafe, which is right. But one instance per thread seems quite costly and it is sad to lose the ability of the asyncio implementation to do multiple things at once on multiple tabs (it can do that, no?)
So I dug into the problem and I found the way to use an asyncio playwright from a multithreaded context safely. Which means you remain able to do multiple things using a single playwright instance concurrently, all while interacting with the browser from multiple threads as this is a legitimate use case for legacy reasons and other usability reasons. In my personal case I have a webmining project named minet in which I need to be able to combine multithreaded
urllib3
orpycurl
calls interwoven with some playwright tasks sometimes, for complex web crawling tasks and I orchestrate the threaded work using the quenouille library. In this context, mixing threads and asyncio for orchestration is a fully-fledged nightmare, so I wanted to find a way to pilot an asynchronous playwright instance from multiple threads.The solution is therefore the following:
asyncio.run_coroutine_threadsafe
There is some threading glue code involved of course for synchronization but the rest is pretty straightforward.
Here is an example of such a class: https://github.com/medialab/minet/blob/master/minet/browser/threadsafe_browser.py Here is an example of it being used: https://github.com/medialab/minet/blob/master/ftest/playwright-threading.py
But now I have some questions:
Some other related notes:
playwright
command line programmatically from python. I do it like so by copying/repurposing some internal code: https://github.com/medialab/minet/blob/master/minet/browser/plawright_shim.py