Open temoto opened 12 years ago
I think step #1, as you identified, is to do a crawl and gather some data on the frequency of these rules. My guess is, you're intuition is right and we'll find that most rules are, in fact, very similar. What those rules are will also dictate the solution.
If in 90% of the cases the rules are simple allow/disallow all, then just taking care of that would give you a major win.. Beyond that, there are all kinds of compression tricks that you could apply towards building a compact representation of the rules. But once again, gather data before making any of these design decisions.. :)
Preliminary research based on 36000 robots.txt from random web sites of Alexa top 1M produced following unsatisfying results: 11.2% of robots.txt-s allow all, 1.5% disallow all. Few observations (e.g. 20% of hosts served by Wordpress) lead to conclusion that analyzed data is biased, not representative subset of whole Web. I will conduct another research involving more hosts later.
Currently FromResponseBytes will return singleton RobotsData of allow-all for 404 status code and disallow-all for 401/403 status codes. For any other input, unique RobotsData will be created even though they could share subset of or all rules. Sharing all rules is equivalent to having another singleton RobotsData.
Plan:
If hypothesis is confirmed, some normalization technique could be applied to reduce memory footprint and cache locality of real world web crawlers using robotstxt.go library.
Possible normalizations:
These two techniques do not even conflict, post-processing parsed rules seems a worthy optimisation anyway, exporting unique value could further allow to cache non-trivial but still popular rules.
Even in unlikely event that rule sets distribution is closer to uniform, distribution of individual rules definitely must exhibit large spikes around agent=* and url=/. For that case, library can return singleton popular rules. Now that i think of it, maintaining a few extremely popular individual Rule singletons could be a worthy optimisation on its own. TODO: benchmark.