laurentprudhon / nlptextdoc

Suite of tools to extract and annotate language resources for NLP applications
Other
1 stars 2 forks source link

Implement a fine grained "% of unique text blocks" stopping scheme #25

Closed laurentprudhon closed 5 years ago

laurentprudhon commented 5 years ago

Today, we stop the crawl if the % of unique text blocks falls below 10%.

But sometimes, the problem is just that Abot is crawling a repetitive part of the website, and there are a lot of other interesting pages elsewhere on the site.

For example, while we are stuck enumerating all actions in this part of the website : http://bourse.latribune.fr/webfg/action/AIR-LIQUIDE-Euronext-Paris http://bourse.latribune.fr/webfg/action/TELEFONICA … the % of unique text block will fall below 10% But it doesn't mean there isn't a whole lot of interesting content in other parts of this site.

Implement a finer grained stopping scheme, where we simply stop crawling the current directory and all its subdirectories, reset the metric, and start again, until there is nothing more to crawl.

laurentprudhon commented 5 years ago

Couldn't find a generic way to do this + #33 and #35 enhancements enable a better way to deal with this problem.

Recommended process :