Closed laurentprudhon closed 5 years ago
Again a feature that would be difficult to get right with a purely automated solution. Better solution is to document this check and rely on the user to insert urls to exclude in the new exclusions list in the parameters file.
Try to detect the langage of a page while crawling based on :
Add a parameter to filter the pages which will be extracted based on the language.
Examples :
https://mabanque.bnpparibas/en/freeinternationalwithdrawals | <html lang="en"
https://www.cic.fr/es/banco/informacion-legal/proteccion-de-datos-personales.html | <html lang="es"
https://www.cnp.fr/en/Journalist/All-our-press-releases/2015/Trisomie-21-France-launches-its-online-medical-monitoring-resource-santetresfacile.fr-with-the-support-of-the-CNP-Assurances-Foundation | <html lang="en"
https://www.economie.gouv.fr/igpde-editions-publications/monthly-notices-on-public-management | <html lang="fr"
http://www.fbf.fr/en/press-room/press-releases/banks-committed-to-the-fight-against-the-financing-of-terrorism?utm_source=PT22019&utm_medium=Email&utm_campaign=PT22019 | <html lang="fr"
https://www.home.saxo/nl-nl/products/bonds | <html lang="nl"
https://www.home.saxo/zh-hk/products/listed-options | <html lang="zh"
https://www.impots.gouv.fr/portail/international-particulier/when-and-where-declare | <html lang="fr"