Closed f100024 closed 2 years ago
hi,
could you elaborate on what issue it's resolving, with examples if possible?
hi.
sure. I've faced with next behavior:
Links from example:
1 and 2 links returns Content-Type: ...windows-1251
and their content is also in windows-1251
let's take a look:
curl -Is https://www.pravda.com.ua/rss/
HTTP/2 200
server: nginx
content-length: 11791
x-frame-options: SAMEORIGIN
x-content-type-options: nosniff
vary: Accept-Encoding
accept-ranges: bytes
via: 1.1 google
date: Tue, 18 Jan 2022 19:04:52 GMT
last-modified: Tue, 18 Jan 2022 19:00:03 GMT
etag: "61e70e33-2e0f"
content-type: text/xml; charset=windows-1251
age: 47
cache-control: must-revalidate,no-transform,public,max-age=300
alt-svc: clear
Same for Read mode
:
curl -Is https://www.pravda.com.ua/news/2022/01/18/7320885/
HTTP/2 200
server: nginx
date: Tue, 18 Jan 2022 19:07:31 GMT
content-type: text/html; charset=windows-1251
x-frame-options: SAMEORIGIN
x-content-type-options: nosniff
vary: Accept-Encoding
via: 1.1 google
cache-control: must-revalidate,no-transform,max-age=60,public
alt-svc: clear
Issue with Read mode
solved by converting content to utf8
, because html.Parse()
here waits for utf8
to input but in my case it got windows-1251
. More info here: https://pkg.go.dev/golang.org/x/net/html#Parse
Issue with feed name were in CharsetReader
, more accurate in Charset.NewReader
.
encoding parsing has beel fully reworked (in https://github.com/nkanaev/yarr/commit/b78c8bf8bf417f719e7c3ea6a0fc98e014a6f3c1 & https://github.com/nkanaev/yarr/commit/52cc8ecbbd7d6e35f595909ddf07d182276d09f9) by @fserb's suggestion here.
expect the changes in the upcoming release.
Fixed encoding in parsing new feed; Fixed encoding in parsing html;