laurentprudhon / nlptextdoc

Suite of tools to extract and annotate language resources for NLP applications
Other
1 stars 2 forks source link

Ignore pages with 0% unique text blocks #27

Closed laurentprudhon closed 5 years ago

laurentprudhon commented 5 years ago

Example : https://www.60millions-mag.com/forum/banque-epargne-credit-f76/

1228 out of 1978 pages contain 0% unique text blocks.

laurentprudhon commented 5 years ago

Only implemented point 1