laurentprudhon / nlptextdoc

Suite of tools to extract and annotate language resources for NLP applications
Other
1 stars 2 forks source link

Some websites are not extracted at all #9

Closed laurentprudhon closed 5 years ago

laurentprudhon commented 5 years ago

No pages extracted at all :

https://bourse.lefigaro.fr/ https://www.macif.fr/assurance/particuliers https://www.natixis.com/ https://epargne.ooreka.fr/ https://www.quechoisir.org/thematique-banque-credit-t111/ https://www.societegenerale.fr/

Very few pages extracted :

https://www.culturebanque.com/ http://www.leparisien.fr/actus/banque https://n26.com/fr-fr/ https://www.orangebank.fr/ https://www.creditmutuel.fr

laurentprudhon commented 5 years ago

Solution 1 : try https to http redirection (not done automatically by the framewrok for security reasons).

https://bourse.lefigaro.fr/ : fixed https://www.natixis.com/ : fixed

laurentprudhon commented 5 years ago

Solution 2 : some websites are protected against bots scraping by solutions like https://datadome.co/fr/. We want to stay a polite crawler, so we just output a message to the user and abort the extraction.

https://www.macif.fr/assurance/particuliers : will not be fixed

laurentprudhon commented 5 years ago

Solution 3 : some websites have content spread across several subdomains of the same base domaine, for example root Uri is www.bank.com and following Uris are savings.bank.com or credit.bank.com. By default, Abot only pages exactly in the same subdomain : we override this behavior to allow crawling everything in the same base domain (*.bank.com).

https://epargne.ooreka.fr/ : fixed https://www.societegenerale.fr/ : fixed

laurentprudhon commented 5 years ago

Solution 4 : some websites contain uncommon syntax for disallow entries in the robots.txt file that are not interpreted correctly by NRobots. For example : Disallow: /? Disallow: /& are both interpreted as : Disallow: / which is different and block any crawl. We fix this behavior to check for the complete Url as specified here : https://developers.google.com/search/reference/robots_txt .

https://www.quechoisir.org/thematique-banque-credit-t111/ : fixed https://www.culturebanque.com/ : fixed https://www.orangebank.fr/ : fixed https://www.creditmutuel.fr : fixed

laurentprudhon commented 5 years ago

Solution 5 : for some websites, the delay between two requests is just too short and must be increased as a command line parameter. We can find errors in the http log file of type : "TooManyRequests".

https://n26.com/fr-fr/ : fixed

laurentprudhon commented 5 years ago

Solution 6 : Some urls are simply out of date and not available anymore. Simply replace them with a new up to date url.

http://www.leparisien.fr/actus/banque : fixed