mozilla / spade

Automated scraping markup+CSS from a list of relevant URLs, using a variety of user-agent strings. Provides reporting on usage of CSS properties and apparent user-agent sniffing.
22 stars 9 forks source link

Make the scan process fail-safe #51

Open maurodoglio opened 11 years ago

maurodoglio commented 11 years ago

If an error happens in either the scraping or the aggregation, the database should be rolled back to the last known good state, an error logged, and the system should keep executing.

mihneadb commented 11 years ago

@maurodoglio Would setting autocommit to off and manually commiting the changes to the db after the successful scan make sense?

maurodoglio commented 11 years ago

Yes, that would be a first step, but I would also add a few fields in the db models to keep track of the status of the process. I'd like to split the scrape->parse->aggregate flow so that it can scale better and we can re-run a single task in case of failures. This is closely related to issue #49