Today, we stop the crawl if the % of unique text blocks falls below 10%.
But sometimes, the problem is just that Abot is crawling a repetitive part of the website, and there are a lot of other interesting pages elsewhere on the site.
Implement a finer grained stopping scheme, where we simply stop crawling the current directory and all its subdirectories, reset the metric, and start again, until there is nothing more to crawl.
Today, we stop the crawl if the % of unique text blocks falls below 10%.
But sometimes, the problem is just that Abot is crawling a repetitive part of the website, and there are a lot of other interesting pages elsewhere on the site.
For example, while we are stuck enumerating all actions in this part of the website : http://bourse.latribune.fr/webfg/action/AIR-LIQUIDE-Euronext-Paris http://bourse.latribune.fr/webfg/action/TELEFONICA … the % of unique text block will fall below 10% But it doesn't mean there isn't a whole lot of interesting content in other parts of this site.
Implement a finer grained stopping scheme, where we simply stop crawling the current directory and all its subdirectories, reset the metric, and start again, until there is nothing more to crawl.