microsoft / playwright-python

Python version of the Playwright testing and automation library.
https://playwright.dev/python/
Apache License 2.0
11.52k stars 881 forks source link

[Question]: I found a way to use a single playwright instance in a multithreaded context #2001

Open Yomguithereal opened 1 year ago

Yomguithereal commented 1 year ago

Your question

Hello playwright team,

I know the consensus about multithreading & playwright is that you should create one playwright instance per thread because playwright is not threadsafe, which is right. But one instance per thread seems quite costly and it is sad to lose the ability of the asyncio implementation to do multiple things at once on multiple tabs (it can do that, no?)

So I dug into the problem and I found the way to use an asyncio playwright from a multithreaded context safely. Which means you remain able to do multiple things using a single playwright instance concurrently, all while interacting with the browser from multiple threads as this is a legitimate use case for legacy reasons and other usability reasons. In my personal case I have a webmining project named minet in which I need to be able to combine multithreaded urllib3 or pycurl calls interwoven with some playwright tasks sometimes, for complex web crawling tasks and I orchestrate the threaded work using the quenouille library. In this context, mixing threads and asyncio for orchestration is a fully-fledged nightmare, so I wanted to find a way to pilot an asynchronous playwright instance from multiple threads.

The solution is therefore the following:

  1. You need to have some class that will spawn a thread in which in new asyncio loop will run
  2. Then you need to start the playwright instance in said thread and make the loop run forever
  3. Then you can send "jobs" using coroutine functions called through asyncio.run_coroutine_threadsafe

There is some threading glue code involved of course for synchronization but the rest is pretty straightforward.

Here is an example of such a class: https://github.com/medialab/minet/blob/master/minet/browser/threadsafe_browser.py Here is an example of it being used: https://github.com/medialab/minet/blob/master/ftest/playwright-threading.py

But now I have some questions:

Some other related notes:

ukenmisneru commented 1 year ago

Hi.Yomguithereal . I got a similar question. Is it possible to the pages with different contexts all open in one playwright instance (one browser window) ? I need to post a lot of different queries for the same website and using different proxies to avoid been detected. It will consumes a lot of system resources if different contexts opened in different browser windows. Thanks

dgtlmoon commented 10 months ago

Always consider that playwright is primarily a testing framework, not a web scraping framework.