laurentprudhon / nlptextdoc

Suite of tools to extract and annotate language resources for NLP applications
Other
1 stars 2 forks source link

Calculate % of unique text blocks based on the number of chars #29

Closed laurentprudhon closed 5 years ago

laurentprudhon commented 5 years ago

Example : http://labourseauquotidien.fr/

This website contains a lot of very small text blocks which are always the same, around one big text block which is always different.

Visually, if we count words / chars, the page contains a good chunk of unique content, but if we only count the text blocks, there is only one unique text block per page.

=> include the number of chars when computing the % of unique text blocks

laurentprudhon commented 5 years ago

In fact, the % of unique text blocks is already computed like that ...