laurentprudhon / nlptextdoc

Suite of tools to extract and annotate language resources for NLP applications
Other
1 stars 2 forks source link

Page characters decoding problems #17

Closed laurentprudhon closed 5 years ago

laurentprudhon commented 5 years ago

Example 1 :

https://fr.finance.yahoo.com/actualites/air-france-klm-resultats-annuels-061600709.html ��i{�6�0���\n�3��7��)�ǖ�q�؉��$g&7GP͘

NB : the page below is properly decoded https://fr.finance.yahoo.com/quote/%5ESTOXX50E/community?p=%5ESTOXX50E

Example 2 :

https://www.banquepopulaire.fr/portailinternet/Catalogue/Produits/Pages/carte-visa-classique.aspx Soit le d��bit immédiat de vos paiements par carte

Example 3 :

https://www.banquepopulaire.fr/portailinternet/Editorial/Informations/Pages/flash-marches-du-19-au-13-janvier.aspx?EditorialVaryLevel=4&EditorialVaryHashPath=-1713699589 PEA, l���épargne en actions

Example 4 :

https://www.banquepopulaire.fr/portailinternet/Editorial/Informations/Pages/succession-recuperation-prestations-sociales.aspx?EditorialVaryLevel=4&EditorialVaryHashPath=-1628741548 Comment optimiser votre rémun��ration ?

Example 5 :

https://www.yomoni.fr/partenaires?partenaire=CAPITAINE_EPARGNE&utm_source=Capitaine_Epargne&utm_medium=partenaire&utm_campaign=epargne �Խ��F�(��>DG�U!BnX�D�e���l��ѝ ���"�"

Example 6 :

https://www.amazon.fr/Canon-2578A009AA-Objectif-70-200-4-0/dp/B00005QF6T/ref=as_li_ss_tl?ie=UTF8&qid=1529938464&sr=8-4&keywords=Canon+EF+70-200mm+f/4L+Canon&linkCode=sl1&tag=bestcanonlenses-21&linkId=87f7628d43c69759e63d49a362a741a3 ��ϱ\n�0�ݧ�f;�AhJ���g(�&iSbZ����A�

Example 7 :

https://www.mon-epargne.com/groupama-banque Erreur dans l'ex�cution de la requ�te 'select * from offr

laurentprudhon commented 5 years ago

Example 1 : failed to reproduce Example 2 : failed to reproduce Example 3 : failed to reproduce Example 4 : failed to reproduce

laurentprudhon commented 5 years ago

Example 5 : automatic gzip / deflate decompression wasn't enabled in the Abot config params ! Just needed to switch this config param to true … => fixed Example 6 : fixed

laurentprudhon commented 5 years ago

Example 7 : the server doesn't send any encoding hint and returns an answer which isn't encoded as UTF-8 => impossible to fix at this level. Note : the browsers also display the decoding errors on screen.