mozilla / spade

Automated scraping markup+CSS from a list of relevant URLs, using a variety of user-agent strings. Provides reporting on usage of CSS properties and apparent user-agent sniffing.
22 stars 9 forks source link

Debugging spider #7

Closed samliu closed 12 years ago

samliu commented 12 years ago

So now it crawls from a text file, and stuff like bad parsing for css/html/js doesn't trip it up. I ran it for about 20 minutes and it was doing well. I also made it so that it stores flat files with up to 100 characters from the url, rather than unlimited, because I was getting disk errors about filenames being too long.

I left the code for parsing js and saving it because it works -- we can remove it anytime pretty easily. I'm only wondering since we have this functionality whether it's considered valuable data, cuz if in the future we want the JS we have it now?

Also added docstrings to the spider