Reduce memory footprint of multiple logically same RobotsData acquired from different inputs

temoto commented 12 years ago

Currently FromResponseBytes will return singleton RobotsData of allow-all for 404 status code and disallow-all for 401/403 status codes. For any other input, unique RobotsData will be created even though they could share subset of or all rules. Sharing all rules is equivalent to having another singleton RobotsData.

Plan:

Get representative subset of all robots.txt files (excluding non-200 responses since those are already covered)
Parse
Find clusters of unique rules sets, analyse distribution Wild guess is that most (say 95%) fall into one of "allow all" or "disallow all".

If hypothesis is confirmed, some normalization technique could be applied to reduce memory footprint and cache locality of real world web crawlers using robotstxt.go library.

Possible normalizations:

Post-process array of rules after parsing, try to reduce to predefined RobotsData singletons; Easy to implement but useful only for predefined unique rule sets, relies on singletons; TODO: analyse output distribution, benchmark
Export a unique value representing parsed rules; Does not reduce memory footprint by itself, rather provides an instrument for that, but useful for arbitrary repeating rule sets, application may implement arbitrary cache. One possible candidate for such value is input text pre-processing: remove comments, white space; TODO: analyse output distribution, benchmark

These two techniques do not even conflict, post-processing parsed rules seems a worthy optimisation anyway, exporting unique value could further allow to cache non-trivial but still popular rules.

Even in unlikely event that rule sets distribution is closer to uniform, distribution of individual rules definitely must exhibit large spikes around agent=* and url=/. For that case, library can return singleton popular rules. Now that i think of it, maintaining a few extremely popular individual Rule singletons could be a worthy optimisation on its own. TODO: benchmark.

igrigorik commented 12 years ago

I think step #1, as you identified, is to do a crawl and gather some data on the frequency of these rules. My guess is, you're intuition is right and we'll find that most rules are, in fact, very similar. What those rules are will also dictate the solution.

If in 90% of the cases the rules are simple allow/disallow all, then just taking care of that would give you a major win.. Beyond that, there are all kinds of compression tricks that you could apply towards building a compact representation of the rules. But once again, gather data before making any of these design decisions.. :)

temoto commented 11 years ago

Preliminary research based on 36000 robots.txt from random web sites of Alexa top 1M produced following unsatisfying results: 11.2% of robots.txt-s allow all, 1.5% disallow all. Few observations (e.g. 20% of hosts served by Wordpress) lead to conclusion that analyzed data is biased, not representative subset of whole Web. I will conduct another research involving more hosts later.

temoto / robotstxt

Reduce memory footprint of multiple logically same RobotsData acquired from different inputs #5