Closed PatrickGotthard closed 11 years ago
=== This comment was migrated from JIRA === Author: snoopdave Created: Mon Aug 13 21:32:50 CEST 2007 ===========================================
Indicating that this is (probably) a Fetcher bug.
=== This comment was migrated from JIRA === Author: nlothian Created: Mon Jan 12 04:44:04 CET 2009 ===========================================
Not a fetcher bug. Here is some code to replicate it (run against
http://www.expasy.org/spotlight/atom.xml):
public static void main(String[] args) {
boolean ok = false;
if (args.length==1) {
try {
URL feedUrl = new URL(args0);
SyndFeedInput input = new SyndFeedInput();
SyndFeed feed = input.build(new XmlReader(feedUrl.openStream(),
"text/html; charset=ISO-8859-1", true));
List entries = feed.getEntries();
for (Iterator iterator = entries.iterator(); iterator.hasNext()
{ SyndEntry entry = (SyndEntry) iterator.next(); String desc = entry.getDescription().getValue(); String encodedDesc = new String(desc.getBytes(Charset.forName("ISO-8859-1"))); System.out.println(desc); System.out.println(encodedDesc); }
ok = true;
}
catch (Exception ex)
{ ex.printStackTrace(); System.out.println("ERROR: "+ex.getMessage()); }
}
if (!ok)
{ System.out.println(); System.out.println("FeedReader reads and prints any RSS/Atom feedtype."); System.out.println("The first parameter must be the URL of the feedto read."); System.out.println(); }
}
XmlReader seems to correctly detect that it is ISO-8859-1, but
entry.getDescription().getValue() and
desc.getBytes(Charset.forName("ISO-8859-1")) get different things (I'm not
entirely sure what should happen here..)
=== This comment was migrated from JIRA === Author: nlothian Created: Mon Jan 12 21:39:55 CET 2009 ===========================================
Thanks to Martin Kurz for taking a look at this.
It appears that the RSS feed referenced isn't ISO-8859-1 encoded - it's probably WINDOWS-1252.
ROME can't detect or correct that kind of problem - it needs to be fixed on the
server side.
If I get this feed see URL which uses ISO-8859-1 encoding with the
FeedFetcher, the contents are not converted to normal Java UTF-16 strings, this
needs to be done manually!
FeedFetcher fetcher = new HttpURLFeedFetcher(cache);
SyndFeed feed = fetcher.retrieveFeed(url);
...
String desc = entry.getDescription().getValue(); // desc is not UTF-16!
desc = new String(desc.getBytes(Charset.forName("ISO-8859-1"))); // now OK