mozilla / spade

Automated scraping markup+CSS from a list of relevant URLs, using a variety of user-agent strings. Provides reporting on usage of CSS properties and apparent user-agent sniffing.
22 stars 9 forks source link

Various tweaks #4

Closed carljm closed 12 years ago

carljm commented 12 years ago

This pull request makes the following changes:

I didn't end up removing the CrawlList stuff at this point because I realized that would break the way the scraper currently works, but once we implement taking urls from the commandline I think we should just do that. In the long run I also think the scraper management command should probably get smarter and just expose options that are relevant to our needs (i.e. basically just take a single argument, the file of URLs to scrape), and then call scrapy with the appropriate arguments, rather than exposing the full scrapy command-line UI.

samliu commented 12 years ago

looks great! I'll work on getting it to have the commandline arg and remove the database aspect myself :D

ahhh and now I remember being told about docstrings at some point. totally forgot, makes lots of sense. \o/