internetarchive / brozzler

brozzler - distributed browser-based web crawler
Apache License 2.0
648 stars 96 forks source link

how does worker pick a site after crash? #231

Open mishranitin2003 opened 2 years ago

mishranitin2003 commented 2 years ago

Scenario: I have warcprox and brozzler worker running on my local machine. While in the middle of archiving a website, if brozzler worker process is killed such as either using 'kill -9 ' or closing the console session. After both warcprox and brozzler worker instances are restarted (on same ports as before), the site will not be picked for crawling. This is due to reason that db('Brozzler').table('sites').claimed property = true.

Query:

nlevitt commented 2 years ago

If you wait an hour, it should start crawling again. See https://github.com/internetarchive/brozzler/blob/e23fa68d6/brozzler/frontier.py#L117. If you can't wait, you could set claimed=false in rethinkdb.

mishranitin2003 commented 2 years ago

Thanks @nlevitt for your quick reply. The problem is deciding when to make claimed=false. Is there any specific reason to choose 60 minutes or is just random? Do you think it would be acceptable to make this 60 minutes configurable? If yes, please let me know and I can raise a PR for the same and if you need branch name to be against issue #231 or something else?

nlevitt commented 2 years ago

@mishranitin2003 It's not random. It has to be high enough that you will never have one worker claim a site when another is legitimately working on it. The value should not be configurable.