Exclude pages based on the langage - Githubissues

laurentprudhon / nlptextdoc

Suite of tools to extract and annotate language resources for NLP applications

Other

1 stars 2 forks source link

Exclude pages based on the langage #19

Closed laurentprudhon closed 5 years ago

laurentprudhon commented 5 years ago

Try to detect the langage of a page while crawling based on :

a language pattern in the URL
the <html lang= attribute

Add a parameter to filter the pages which will be extracted based on the language.

Examples :

https://mabanque.bnpparibas/en/freeinternationalwithdrawals | <html lang="en"

https://www.cic.fr/es/banco/informacion-legal/proteccion-de-datos-personales.html | <html lang="es"

https://www.cnp.fr/en/Journalist/All-our-press-releases/2015/Trisomie-21-France-launches-its-online-medical-monitoring-resource-santetresfacile.fr-with-the-support-of-the-CNP-Assurances-Foundation | <html lang="en"

https://www.economie.gouv.fr/igpde-editions-publications/monthly-notices-on-public-management | <html lang="fr"

http://www.fbf.fr/en/press-room/press-releases/banks-committed-to-the-fight-against-the-financing-of-terrorism?utm_source=PT22019&utm_medium=Email&utm_campaign=PT22019 | <html lang="fr"

https://www.home.saxo/nl-nl/products/bonds | <html lang="nl"

https://www.home.saxo/zh-hk/products/listed-options | <html lang="zh"

https://www.impots.gouv.fr/portail/international-particulier/when-and-where-declare | <html lang="fr"

laurentprudhon commented 5 years ago

Again a feature that would be difficult to get right with a purely automated solution. Better solution is to document this check and rely on the user to insert urls to exclude in the new exclusions list in the parameters file.