Open mishranitin2003 opened 3 years ago
If you wait an hour, it should start crawling again. See https://github.com/internetarchive/brozzler/blob/e23fa68d6/brozzler/frontier.py#L117. If you can't wait, you could set claimed=false
in rethinkdb.
Thanks @nlevitt for your quick reply. The problem is deciding when to make claimed=false. Is there any specific reason to choose 60 minutes or is just random? Do you think it would be acceptable to make this 60 minutes configurable? If yes, please let me know and I can raise a PR for the same and if you need branch name to be against issue #231 or something else?
@mishranitin2003 It's not random. It has to be high enough that you will never have one worker claim a site when another is legitimately working on it. The value should not be configurable.
Scenario: I have warcprox and brozzler worker running on my local machine. While in the middle of archiving a website, if brozzler worker process is killed such as either using 'kill -9' or closing the console session.
After both warcprox and brozzler worker instances are restarted (on same ports as before), the site will not be picked for crawling. This is due to reason that db('Brozzler').table('sites').claimed property = true.
Query: