nkanaev / yarr

yet another rss reader
MIT License
3.03k stars 224 forks source link

Fixed encoding #85

Closed f100024 closed 2 years ago

f100024 commented 2 years ago

Fixed encoding in parsing new feed; Fixed encoding in parsing html;

nkanaev commented 2 years ago

hi,

could you elaborate on what issue it's resolving, with examples if possible?

f100024 commented 2 years ago

hi.

sure. I've faced with next behavior: 1 2

Links from example:

  1. https://www.pravda.com.ua/rss/
  2. https://news.finance.ua/ru/rss/
  3. https://www.opennet.ru/opennews/opennews_all_utf.rss

1 and 2 links returns Content-Type: ...windows-1251 and their content is also in windows-1251 let's take a look:

curl -Is https://www.pravda.com.ua/rss/ 
HTTP/2 200 
server: nginx
content-length: 11791
x-frame-options: SAMEORIGIN
x-content-type-options: nosniff
vary: Accept-Encoding
accept-ranges: bytes
via: 1.1 google
date: Tue, 18 Jan 2022 19:04:52 GMT
last-modified: Tue, 18 Jan 2022 19:00:03 GMT
etag: "61e70e33-2e0f"
content-type: text/xml; charset=windows-1251
age: 47
cache-control: must-revalidate,no-transform,public,max-age=300
alt-svc: clear

Same for Read mode:

curl -Is https://www.pravda.com.ua/news/2022/01/18/7320885/
HTTP/2 200 
server: nginx
date: Tue, 18 Jan 2022 19:07:31 GMT
content-type: text/html; charset=windows-1251
x-frame-options: SAMEORIGIN
x-content-type-options: nosniff
vary: Accept-Encoding
via: 1.1 google
cache-control: must-revalidate,no-transform,max-age=60,public
alt-svc: clear

Issue with Read mode solved by converting content to utf8, because html.Parse() here waits for utf8 to input but in my case it got windows-1251. More info here: https://pkg.go.dev/golang.org/x/net/html#Parse

Issue with feed name were in CharsetReader, more accurate in Charset.NewReader.

nkanaev commented 2 years ago

encoding parsing has beel fully reworked (in https://github.com/nkanaev/yarr/commit/b78c8bf8bf417f719e7c3ea6a0fc98e014a6f3c1 & https://github.com/nkanaev/yarr/commit/52cc8ecbbd7d6e35f595909ddf07d182276d09f9) by @fserb's suggestion here.

expect the changes in the upcoming release.