postlight / parser-api

🚀 A drop-in replacement for the Postlight Parser API.
https://reader.postlight.com/
Apache License 2.0
282 stars 113 forks source link

Every accented characters are corrupted #13

Open DrLuthor opened 5 years ago

DrLuthor commented 5 years ago

Expected Behavior

When you POST a request with only the URL parameter. The response is UTF-8 friendly. When I use the html parameter, response should be utf-8 friendly too.

The API should return a title like this : "Le démantèlement des réacteurs nucléaires, véritable filière industrielle" And content like this : ... <p><strong>Dans les prochaines ann&#xE9;es, avec la transition &#xE9;nerg&#xE9;tique et le d&#xE9;mant&#xE8;lement ...

Current Behavior

Title returned : "Le d�mant�lement des r�acteurs nucl�aires, v�ritable fili�re industrielle" Content returned: ...<p><strong>Dans les prochaines ann**&#xFFFD;**es, avec la transition &#xFFFD;nerg&#xFFFD;tique et le d&#xFFFD;mant&#xFFFD;lement ...

Steps to Reproduce

I just do a POST to the parse-html endpoint { "url": "https://www.europeanscientist.com/fr/energie/demantelement-reacteurs-nucleaires-dechets-pngmdr/", "html" : [copy_paste_of_html_code] }

Possible Solution

I tried to force header's request Content-type to utf-8 with application/json; charset=utf-8 but it doesn't change the result. While running this request locally, I've got an Iconv-lite deprecation warning related to encoding Iconv-lite warning: decode()-ing strings is deprecated. Refer to https://github.com/ashtuchkin/iconv-lite/wiki/Use-Buffers-when-decoding