kurtmckee / feedparser

Parse feeds in Python
https://feedparser.readthedocs.io/en/latest/
Other
1.89k stars 336 forks source link

encodings: decode utf-8 with errors='replace' when confident #421

Open Rongronggg9 opened 6 months ago

Rongronggg9 commented 6 months ago

"Confident" means "metadata of the document explicitly indicates that the encoding is UTF-8".

Background of the patch

When a UTF-8 feed has a few invalid characters but the rest is fine, feedparser will only parse it as iso-8859-2 (or other encodings detected by chardet, if installed), even if both the HTTP and XML headers explicitly indicate that its encoding is utf-8.

To handle it better, we should decode the feed as UTF-8 with errors='replace'.

butaford commented 5 months ago

Please accept "Pull requests". Everything works as it should with him!