microsoft / playwright-java

Java version of the Playwright testing and automation library
https://playwright.dev/java/
Apache License 2.0
1.15k stars 207 forks source link

[Feature]: Playwright Cluster to parallelize processing request on playwright #593

Closed nikitha12 closed 3 years ago

nikitha12 commented 3 years ago

Feature request

I want to process a large number of requests in order of 100 000 pages quickly. To achieve that I might have to process 500-600 requests in parallel

Currently, for each URL I launch an instance of playwright and close it when I am done. But based on the resources available on a machine/ pod the number of playwright instances that can be launched is limited. To scale, I either have to increase resources or the number of pods. I am looking for a better way to parallelize request processing on playwright

Can we have a Playwright cluster similar to the puppeteer cluster? There will be a pool of playwright instances. To process a request we would take proxy and config to describe things to do after page navigation. The cluster can handle errors and retries.

yury-s commented 3 years ago

Currently, for each URL I launch an instance of playwright and close it when I am done.

Why closing playwright instance? You can just close the context and create new one and a new page in it which is way faster, this is the recommended way of isolating between pages and maintaining good performance.

Can we have a Playwright cluster similar to the puppeteer cluster?

It should be fairly easy to replace Puppeteer with Playwright in puppeteer cluster implementation and have similar implementation. I don't know much about puppeteer cluster but from a brief look it seems to me that for this particular task it might be easier to control multiple pages in the same context/multiple context in the same browser concurrently from nodejs playwright due to the synchronous nature of Java API.

Scraping use cases are not a priority at the moment and we don't have plans working in this direction in the java client, so I'm closing this request.