mozilla / spade

Automated scraping markup+CSS from a list of relevant URLs, using a variety of user-agent strings. Provides reporting on usage of CSS properties and apparent user-agent sniffing.
22 stars 9 forks source link

Improve system scalability and performance #49

Open maurodoglio opened 11 years ago

maurodoglio commented 11 years ago

System needs to be able to run through a 1000 sites in 4-8 hours to do both scraping and analysis

maurodoglio commented 11 years ago

maybe if we can't scan fast enough, we should make it so that scanning/aggregation can be done by expanding to multiple machines in parallel, all writing into (or having their databases replicating into) a central database that drives the UI.

maurodoglio commented 11 years ago

A starting point could be the introduction of celery. We could split the scan->parse-aggregate flow in asynchronous tasks, in order to