scsibug / feedparser-clj

Atom/RSS Feed Parsing for Clojure
Other
102 stars 33 forks source link

Parser does not respect xml encoding. #9

Closed Gonzih closed 5 years ago

Gonzih commented 10 years ago

Hi, I'm using your amazing lib in my feeds2imap.clj project. Recently one person reported issue with this feed http://ibash.org.ru/rss.xml. Looks like feed is using windows-1251 encoding (which is horrible), but still. Is there any way to make parser respect encoding specified in xml and convert everything to unicode?

Thanks!

gfrivolt commented 8 years ago

I have the same issue with another feed, http://www.hirek.sk/rss/hirek.xml. What do you suggest, how to put the source into the proper encoding? Does feedparser-clj has a support for that or should an other lib be used for preprocessing?

Gonzih commented 8 years ago

no, feedparser-clj does not do anything. I can try to fetch data on my own and maybe try to detect encoding and then convert it to unicode. I'm still surprised that there are websites not using unicode. I will take a look at that once I have some spare time. Thanks for providing another example.

On 01/31/2016 08:57 PM, György Frivolt wrote:

I have the same issue with another feed, http://www.hirek.sk/rss/hirek.xml. What do you suggest, how to put the source into the proper encoding? Does feedparser-clj has a support for that or should an other lib used for preprocessing?

— Reply to this email directly or view it on GitHub https://github.com/scsibug/feedparser-clj/issues/9#issuecomment-177597400.

gfrivolt commented 8 years ago

I'm also surprised, but that's reality. I checked the referred java libraries and it seems only few encodings, ascii, utf-8, utf-16,... are supported.

Maybe it's not feedparser-clj's job to do the conversion. Maybe a recommendation about how to pre-process the feeds is sufficient. Probably for most of the feedparser-clj user unicode is enough.

What would you recommend, what to use to do the conversion?

Gonzih commented 8 years ago

I would say java interop can do the trick.

gfrivolt commented 8 years ago

which java to interop with? :) which library, do you know some resource/document where the encoding conversion is documented?

Gonzih commented 8 years ago

http://stackoverflow.com/questions/5729806/encode-string-to-utf-8

On 02/02/2016 04:24 PM, György Frivolt wrote:

which java to interop with? :) which library, do you know some resource/document where the encoding conversion is documented?

— Reply to this email directly or view it on GitHub https://github.com/scsibug/feedparser-clj/issues/9#issuecomment-178631507.