rometools / rome

Java library for RSS and Atom feeds
https://rometools.github.io/rome
Apache License 2.0
903 stars 168 forks source link

encoding problem in XmlReader.getXmlProlog() #72

Closed PatrickGotthard closed 10 years ago

PatrickGotthard commented 10 years ago
=== This issue was migrated from JIRA ===
Type: Bug
Priority: Major
Status: Resolved
Resolution: Fixed
Reported by: denverdino
Assigned to: ROME Jira Lead
Created: Thu Apr 26 12:51:37 CEST 2007
Updated: Thu Sep 15 20:55:33 CEST 2011
Resolved: Sat Jun 02 01:57:01 CEST 2007
Version: current
Fix version: milestone 1
JIRA Link: https://rometools.jira.com/browse/ROME-71
=========================================

The com.sun.syndication.io.XmlReader.getXmlProlog() has some encoding problem.
It use the buffer (size PUSHBACK_MAX_SIZE) to guesse the XML encoding, it may
truncate the char wrong, then br.readLine() will have the problem.

The problem will occurs very commonly in the non-western feeds. E.g.
http://muwayne.spaces.live.com/feed.rss

PatrickGotthard commented 10 years ago
=== This comment was migrated from JIRA ===
Author: tucu
Created: Fri Apr 27 02:17:15 CEST 2007
===========================================

The provided feed is parsed successfully (using the FeedReader example)

PatrickGotthard commented 10 years ago
=== This comment was migrated from JIRA ===
Author: denverdino
Created: Tue May 15 05:23:34 CEST 2007
===========================================

When the buffer (size PUSHBACK_MAX_SIZE) just pass the char (part of char in the
buffer, part of it is not in the buffer), the exception will be thrown. Pls try
http://hi.baidu.com/huisemumuxi/rss

sun.io.MalformedInputException
at sun.io.ByteToCharUTF8.convert(ByteToCharUTF8.java:262)
at sun.nio.cs.StreamDecoder$ConverterSD.convertInto(StreamDecoder.java:314)
at sun.nio.cs.StreamDecoder$ConverterSD.implRead(StreamDecoder.java:364)
at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:250)
at java.io.InputStreamReader.read(InputStreamReader.java:212)

PatrickGotthard commented 10 years ago
=== This comment was migrated from JIRA ===
Author: tucu
Created: Tue May 15 07:43:05 CEST 2007
===========================================

I'm not seeing the problem in Mac OS X but I think I understood the issue
(multibyte char being split on buffer boundary).

I've think I have a fix for it but I cannot test it.

Attached is the XmlReader source with the changes. I've also attached a ROME jar
with the compiled changes, please test and let me know.

PatrickGotthard commented 10 years ago
=== This comment was migrated from JIRA ===
Author: tucu
Created: Tue May 15 07:44:24 CEST 2007
===========================================

Created an attachment (id=10)
ROME JAR with the proppose fix

PatrickGotthard commented 10 years ago
=== This comment was migrated from JIRA ===
Author: tucu
Created: Tue May 15 07:45:07 CEST 2007
===========================================

Created an attachment (id=11)
XmlReader with fix

PatrickGotthard commented 10 years ago
=== This comment was migrated from JIRA ===
Author: tucu
Created: Sat Jun 02 01:57:01 CEST 2007
===========================================

Changed logic of xml prolog read ahead, to avoid this case

PatrickGotthard commented 10 years ago
=== This comment was migrated from JIRA ===
Author: anonymous
Created: Thu Sep 15 20:55:32 CEST 2011
===========================================

Attachment rome-0.9.1-dev.jar has been added with description: http://java.net/jira/secure/attachment/26734/rome-0.9.1-dev.jar

PatrickGotthard commented 10 years ago
=== This comment was migrated from JIRA ===
Author: anonymous
Created: Thu Sep 15 20:55:33 CEST 2011
===========================================

Attachment XmlReader.java has been added with description: http://java.net/jira/secure/attachment/26735/XmlReader.java