internetarchive / brozzler

brozzler - distributed browser-based web crawler
Apache License 2.0
669 stars 97 forks source link

Performance Suggestions? #185

Open rovo79 opened 4 years ago

rovo79 commented 4 years ago

Hello, I've been utilizing brozzler-easy for testing and brozzler looks to be working wonderfully. I have a very large website I am trying to archive and unsure of a few things that I can't figure out through the job-conf.rst.

I'm running a local version of the website on my local machine. So that site is not running from it's public domain. Is there way to get brozzler to replace my local host domain with the actual public domain?

Another question I have, is there any way to boost the performance? Possibly configure it to use more threads? Currently when I setup a brozzler job and monitor it in Brozzler Dashboard, it shows two sites being actively crawled. Is that an example of Brozzler running two threads to crawl the site?

Maybe there's a writeup somewhere explaining optimal ways to use brozzler on a local machine?

greatly appreciate any insights. Sorry to post this here, not sure how else to get in touch with people on this project.

Thank you.

nlevitt commented 4 years ago

I'm running a local version of the website on my local machine. So that site is not running from it's public domain. Is there way to get brozzler to replace my local host domain with the actual public domain?

Neither brozzler nor warcprox have that functionality built in. But it sounds doable with /etc/hosts.

Another question I have, is there any way to boost the performance? Possibly configure it to use more threads?

You can configure the number of browsers running simultaneously with the -n,--max-browsers option. But only one browser at a time will work on a single site. You might need to reorganize your crawl if you want more parallelization (depending on what you're doing).

Maybe there's a writeup somewhere explaining optimal ways to use brozzler on a local machine?

I'm afraid not, as far as I'm aware :-\