internetarchive / brozzler

brozzler - distributed browser-based web crawler
Apache License 2.0
648 stars 96 forks source link

feat: implementing browserless #252

Open andyMrtnzP opened 1 year ago

andyMrtnzP commented 1 year ago

Allows running Brozzler with browserless.

How to run

  1. Have browserless up and running; either the Docker image or the Node project will work with this implementation
  2. Create an activate a venv
    • python3 -m venv venv
    • source venv/bin/activate
    • pip install -e .
  3. Run the brozzle-page using -e browserless:

    brozzle-page https://www.example.com -e browserless

    1. By default, browserless runs at port 3000. If it's running at another port, use the --browserless-port flag: brozzle-page https://www.example.com -e browserless --browserless-port 3030
    2. To debug Chrome's launch args, see the browserless' log: image

The most significant change in the PR was to generate the Chrome launch args dynamically with a method, instead of a variable, so the WS connection made to browserless at startup time to generate a new browser can include them.

avdempsey commented 1 year ago

On Ubuntu I was able to get things running by following the directions here: https://linuxhandbook.com/docker-permission-denied/ then... docker run -p 3000:3000 browserless/chrome

On my (Intel) Mac, Joel's suggestion worked for me: docker run --user root -p 3000:3000 browserless/chrome

Andy's directions worked for me on both platforms, once I ran the container with the respective strategies above.

galgeek commented 1 year ago

I've gotten this running with Andy's instructions on my (ARM) Mac.
I did need Joel's suggestion, to run the docker image as --user root.

I've also verified that brozzle_page runs as expected without the -e browserless command-line parameter.