rometools / rome

Java library for RSS and Atom feeds
https://rometools.github.io/rome
Apache License 2.0
909 stars 168 forks source link

Fetcher: Encoding problem #87

Closed PatrickGotthard closed 11 years ago

PatrickGotthard commented 11 years ago
=== This issue was migrated from JIRA ===
Type: Bug
Priority: Major
Status: Resolved
Resolution: Incomplete
Reported by: ejain
Assigned to: ROME Jira Lead
Created: Thu Jul 05 17:34:21 CEST 2007
Updated: Mon Jan 12 21:39:55 CET 2009
Resolved: Mon Jan 12 21:39:55 CET 2009
Version: current
Fix version: milestone 1
JIRA Link: https://rometools.jira.com/browse/ROME-86
=========================================

If I get this feed see URL which uses ISO-8859-1 encoding with the
FeedFetcher, the contents are not converted to normal Java UTF-16 strings, this
needs to be done manually!

FeedFetcher fetcher = new HttpURLFeedFetcher(cache);
SyndFeed feed = fetcher.retrieveFeed(url);
...
String desc = entry.getDescription().getValue(); // desc is not UTF-16!
desc = new String(desc.getBytes(Charset.forName("ISO-8859-1"))); // now OK

PatrickGotthard commented 11 years ago
=== This comment was migrated from JIRA ===
Author: snoopdave
Created: Mon Aug 13 21:32:50 CEST 2007
===========================================

Indicating that this is (probably) a Fetcher bug.

PatrickGotthard commented 11 years ago
=== This comment was migrated from JIRA ===
Author: nlothian
Created: Mon Jan 12 04:44:04 CET 2009
===========================================

Not a fetcher bug. Here is some code to replicate it (run against
http://www.expasy.org/spotlight/atom.xml):

public static void main(String[] args) {
boolean ok = false;
if (args.length==1) {
try {
URL feedUrl = new URL(args0);
SyndFeedInput input = new SyndFeedInput();

SyndFeed feed = input.build(new XmlReader(feedUrl.openStream(),
"text/html; charset=ISO-8859-1", true));

List entries = feed.getEntries();
for (Iterator iterator = entries.iterator(); iterator.hasNext()

{ SyndEntry entry = (SyndEntry) iterator.next(); String desc = entry.getDescription().getValue(); String encodedDesc = new String(desc.getBytes(Charset.forName("ISO-8859-1"))); System.out.println(desc); System.out.println(encodedDesc); }


ok = true;
}
catch (Exception ex)

{ ex.printStackTrace(); System.out.println("ERROR: "+ex.getMessage()); }


}

if (!ok)

{ System.out.println(); System.out.println("FeedReader reads and prints any RSS/Atom feedtype."); System.out.println("The first parameter must be the URL of the feedto read."); System.out.println(); }


}

XmlReader seems to correctly detect that it is ISO-8859-1, but
entry.getDescription().getValue() and
desc.getBytes(Charset.forName("ISO-8859-1")) get different things (I'm not
entirely sure what should happen here..)

PatrickGotthard commented 11 years ago
=== This comment was migrated from JIRA ===
Author: nlothian
Created: Mon Jan 12 21:39:55 CET 2009
===========================================

Thanks to Martin Kurz for taking a look at this.

It appears that the RSS feed referenced isn't ISO-8859-1 encoded - it's probably WINDOWS-1252.

ROME can't detect or correct that kind of problem - it needs to be fixed on the
server side.