Closed laurentprudhon closed 5 years ago
Solution 1 : try https to http redirection (not done automatically by the framewrok for security reasons).
https://bourse.lefigaro.fr/ : fixed https://www.natixis.com/ : fixed
Solution 2 : some websites are protected against bots scraping by solutions like https://datadome.co/fr/. We want to stay a polite crawler, so we just output a message to the user and abort the extraction.
https://www.macif.fr/assurance/particuliers : will not be fixed
Solution 3 : some websites have content spread across several subdomains of the same base domaine, for example root Uri is www.bank.com and following Uris are savings.bank.com or credit.bank.com. By default, Abot only pages exactly in the same subdomain : we override this behavior to allow crawling everything in the same base domain (*.bank.com).
https://epargne.ooreka.fr/ : fixed https://www.societegenerale.fr/ : fixed
Solution 4 : some websites contain uncommon syntax for disallow entries in the robots.txt file that are not interpreted correctly by NRobots. For example : Disallow: /? Disallow: /& are both interpreted as : Disallow: / which is different and block any crawl. We fix this behavior to check for the complete Url as specified here : https://developers.google.com/search/reference/robots_txt .
https://www.quechoisir.org/thematique-banque-credit-t111/ : fixed https://www.culturebanque.com/ : fixed https://www.orangebank.fr/ : fixed https://www.creditmutuel.fr : fixed
Solution 5 : for some websites, the delay between two requests is just too short and must be increased as a command line parameter. We can find errors in the http log file of type : "TooManyRequests".
https://n26.com/fr-fr/ : fixed
Solution 6 : Some urls are simply out of date and not available anymore. Simply replace them with a new up to date url.
No pages extracted at all :
https://bourse.lefigaro.fr/ https://www.macif.fr/assurance/particuliers https://www.natixis.com/ https://epargne.ooreka.fr/ https://www.quechoisir.org/thematique-banque-credit-t111/ https://www.societegenerale.fr/
Very few pages extracted :
https://www.culturebanque.com/ http://www.leparisien.fr/actus/banque https://n26.com/fr-fr/ https://www.orangebank.fr/ https://www.creditmutuel.fr