rivermont / spidy

The simple, easy to use command line web crawler.
GNU General Public License v3.0
334 stars 69 forks source link

Autosave triggered by single thread and not global. #56

Open rivermont opened 6 years ago

rivermont commented 6 years ago

Checklist

Expected Behavior

All threads to stop as crawler prints info and saves files.

Actual Behavior

Once one thread reaches SAVE_COUNT links crawled, it saves while the other threads continue. This results in [CRAWL] logs in between [INFO] logs.
It seems like this is inefficient and could result in some saving errors.

Steps to Reproduce the Problem

  1. Run crawler
  2. Wait for the autosave cap to be hit.

Specifications

Hrily commented 4 years ago

It seems like this is inefficient and could result in some saving errors.

@rivermont Can you please elaborate this?

I would like to understand all the cases where this will result in errors.

Hrily commented 4 years ago

@rivermont

I could find out few errors while auto saving and made a PR for the same.

Also, I couldn't find a way to fix logging which takes minimal change. Maybe need to revamp the logging logic so that crawling logging is paused when saving.