Automated scraping markup+CSS from a list of relevant URLs, using a variety of user-agent strings. Provides reporting on usage of CSS properties and apparent user-agent sniffing.
Remove scrapy.cfg. I verified that running scrapy via the management command works just fine without it, so it doesn't seem like keeping it around does us any good.
Added docstrings to a number of modules and classes. This might be confusing since I know we'd talked earlier about using regular comment syntax # for comments, rather than using multi-line string literals. But docstrings are different - every module, class, and function should have one describing what it does, and they should follow the pattern described in PEP 257 (http://www.python.org/dev/peps/pep-0257/). These aren't just comments, they are actually a part of Python - they can be used to autogenerate API documentation, and they are available via Python's online help system.
Moved the timestamp field from SiteScan to URLScan, as I think it'll be useful to have a timestamp for each individual url scanned. Since the main entry-point url itself will have a URLScan object, I don't think we need to duplicate the timestamp in both tables (though we certainly could).
Removed db_index=True from all ForeignKeys and all fields with unique=True - those fields are all automatically indexed, db_index=True is redundant for them.
Added max_length=500 to all the FileFields, as I was getting errors due to filenames being too long and getting truncated.
Used the model class itself, rather than its string name, as argument to ForeignKeys (i.e. ForeignKey(Batch) instead of ForeignKey("Batch"). This makes Django do less lookup magic behind the scenes. Really the only time the string version is needed is if you have circular ForeignKey references and thus the class itself can't be available both ways.
Removed a couple unused imports.
I didn't end up removing the CrawlList stuff at this point because I realized that would break the way the scraper currently works, but once we implement taking urls from the commandline I think we should just do that. In the long run I also think the scraper management command should probably get smarter and just expose options that are relevant to our needs (i.e. basically just take a single argument, the file of URLs to scrape), and then call scrapy with the appropriate arguments, rather than exposing the full scrapy command-line UI.
This pull request makes the following changes:
scrapy.cfg
. I verified that running scrapy via the management command works just fine without it, so it doesn't seem like keeping it around does us any good.#
for comments, rather than using multi-line string literals. But docstrings are different - every module, class, and function should have one describing what it does, and they should follow the pattern described in PEP 257 (http://www.python.org/dev/peps/pep-0257/). These aren't just comments, they are actually a part of Python - they can be used to autogenerate API documentation, and they are available via Python's online help system.timestamp
field fromSiteScan
toURLScan
, as I think it'll be useful to have a timestamp for each individual url scanned. Since the main entry-point url itself will have aURLScan
object, I don't think we need to duplicate the timestamp in both tables (though we certainly could).db_index=True
from allForeignKey
s and all fields withunique=True
- those fields are all automatically indexed,db_index=True
is redundant for them.max_length=500
to all theFileField
s, as I was getting errors due to filenames being too long and getting truncated.ForeignKey
s (i.e.ForeignKey(Batch)
instead ofForeignKey("Batch")
. This makes Django do less lookup magic behind the scenes. Really the only time the string version is needed is if you have circularForeignKey
references and thus the class itself can't be available both ways.I didn't end up removing the
CrawlList
stuff at this point because I realized that would break the way the scraper currently works, but once we implement taking urls from the commandline I think we should just do that. In the long run I also think thescraper
management command should probably get smarter and just expose options that are relevant to our needs (i.e. basically just take a single argument, the file of URLs to scrape), and then call scrapy with the appropriate arguments, rather than exposing the full scrapy command-line UI.