thomasdondorf / puppeteer-cluster

Puppeteer Pool, run a cluster of instances in parallel
MIT License
3.24k stars 310 forks source link

Extra <3 Cluster #228

Open berstend opened 4 years ago

berstend commented 4 years ago

Hey there, great job with puppeteer-cluster πŸ‘

I'm the maintainer of puppeteer-extra and while updating the readme with more usage examples (after rewriting the core in TS) I noticed how well extra + cluster play together. πŸ˜„

import { Cluster } from "puppeteer-cluster"
import vanillaPuppeteer from "puppeteer"

import { addExtra } from "puppeteer-extra"
import Stealth from "puppeteer-extra-plugin-stealth"
import Recaptcha from "puppeteer-extra-plugin-recaptcha"

async function main() {
  // Create a custom puppeteer-extra instance using `addExtra`,
  // so we could create additional ones with different plugin config.
  const puppeteer = addExtra(vanillaPuppeteer)
  puppeteer.use(Stealth())
  puppeteer.use(Recaptcha())

  // Launch cluster with puppeteer-extra
  const cluster = await Cluster.launch({
    puppeteer,
    maxConcurrency: 2,
    concurrency: Cluster.CONCURRENCY_CONTEXT
  })

  // Define task handler
  await cluster.task(async ({ page, data: url }) => {
    await page.goto(url)

    const { hostname } = new URL(url)
    const { captchas } = await page.findRecaptchas()
    console.log(`Found ${captchas.length} captcha on ${hostname}`)

    await page.screenshot({ path: `${hostname}.png`, fullPage: true })
  })

  // Queue any number of tasks
  cluster.queue("https://bot.sannysoft.com")
  cluster.queue("https://www.google.com/recaptcha/api2/demo")
  cluster.queue("http://www.wikipedia.org/")

  await cluster.idle()
  await cluster.close()
  console.log(`All done, check the screenshots. ✨`)
}

// Let's go
main().catch(console.warn)

Some thoughts and observations:

Firefox works really well out of the box with cluster (const vanillaPuppeteer = require("puppeteer-firefox")), which I found neat. πŸ’―

The out of the box cluster experience is a bit impaired when using TypeScript:

node_modules/puppeteer-cluster/dist/Cluster.d.ts:1:23 - error TS2688: Cannot find type definition file for 'node'.
1 /// <reference types="node" />
                        ~~~~
node_modules/puppeteer-cluster/dist/Cluster.d.ts:2:37 - error TS2307: Cannot find module 'puppeteer'.
2 import { LaunchOptions, Page } from 'puppeteer';
                                      ~~~~~~~~~~~
node_modules/puppeteer-cluster/dist/Cluster.d.ts:3:30 - error TS2307: Cannot find module 'events'.
3 import { EventEmitter } from 'events';
                               ~~~~~~~~
node_modules/puppeteer-cluster/dist/concurrency/ConcurrencyImplementation.d.ts:1:37 - error TS2307: Cannot find module 'puppeteer'.
1 import { Page, LaunchOptions } from 'puppeteer';
                                      ~~~~~~~~~~~

There seems to be no strong consensus in the TS community how to best handle this, but the majority seems to lean towards moving type dependencies from devDependencies to the regular ones in npm packages for better DX of TS users.

If you ever need to extend the Puppeteer interfaces (e.g. page.findRecaptchas()):

I eventually found a type safe way by shipping an ambient d.ts. This is pretty rough to do currently and requires some additional tooling but it works reliably (even when piping puppeteer through cluster). :)

Anyway, just wanted to give you a quick thumbs up for your work on cluster - this ticket can therefore be closed. :)

berstend commented 4 years ago

PS: I made a documentation.js fork which allows to generate docs that look very similar to the pptr ones (based on JSDoc annotated TypeScript): https://github.com/berstend/documentation-markdown-themes/wiki#documentationjs-with-markdown-theme-support

e.g. this API documentation is auto-generated: https://github.com/berstend/puppeteer-extra/tree/master/packages/puppeteer-extra#api

thomasdondorf commented 4 years ago

Thanks for your kind words, much appreciated. :)

That documentation generator looks super nice. I hope I have some time over the holidays to give it a try. So far, I wrote the markdown by hand...

Regarding TypeScript: Interesting, so I guess I can resolve the problem by moving the type dependencies from devDependencies to dependencies? I guess this is a small burden for non-Typescript users anyway.