prof18 / RSS-Parser

A Kotlin Multiplatform library to parse a RSS Feed
Apache License 2.0
516 stars 128 forks source link

Parsing feed fails if it has html encoded characters #204

Open shtolik opened 3 months ago

shtolik commented 3 months ago

Describe the bug I tried to parse the feed https://myrskyla.fi/feed/ but it contains in a title tag Ä instead of Ä which then leads to exceptions and failing to parse feed both on android and ios side. android:

RssParsingException(message=Something went wrong during the parsing of the feed. Please check if the XML is valid, cause=org.xmlpull.v1.XmlPullParserException: unresolved: ä (position:TEXT @11:22 in java.io.InputStreamReader@4290534) )
at com.prof18.rssparser.internal.AndroidXmlParser$parseXML$2.invokeSuspend(AndroidXmlParser.kt:67)
at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:33)
at kotlinx.coroutines.DispatchedTask.run(DispatchedTask.kt:104)
at kotlinx.coroutines.internal.LimitedDispatcher$Worker.run(LimitedDispatcher.kt:111)
at kotlinx.coroutines.scheduling.TaskImpl.run(Tasks.kt:99)
at kotlinx.coroutines.scheduling.CoroutineScheduler.runSafely(CoroutineScheduler.kt:585)
at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.executeTask(CoroutineScheduler.kt:802)
at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.runWorker(CoroutineScheduler.kt:706)
at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.run(CoroutineScheduler.kt:693)
Caused by: org.xmlpull.v1.XmlPullParserException: unresolved: ä (position:TEXT @11:22 in java.io.InputStreamReader@4290534)
at com.android.org.kxml2.io.KXmlParser.checkRelaxed(KXmlParser.java:305)
at com.android.org.kxml2.io.KXmlParser.readEntity(KXmlParser.java:1285)
at com.android.org.kxml2.io.KXmlParser.readValue(KXmlParser.java:1402)
at com.android.org.kxml2.io.KXmlParser.next(KXmlParser.java:393)
at com.android.org.kxml2.io.KXmlParser.next(KXmlParser.java:313)
at com.android.org.kxml2.io.KXmlParser.nextText(KXmlParser.java:2077)
at com.prof18.rssparser.internal.XmlPullParser_Kt.nextTrimmedText(XmlPullParser+.kt:5)
at com.prof18.rssparser.internal.rss.RssParserKt.extractRSSContent(RssParser.kt:289)
at com.prof18.rssparser.internal.AndroidXmlParser$parseXML$2.invokeSuspend(AndroidXmlParser.kt:54)
at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:33) 
at kotlinx.coroutines.DispatchedTask.run(DispatchedTask.kt:104) 
at kotlinx.coroutines.internal.LimitedDispatcher$Worker.run(LimitedDispatcher.kt:111) 
at kotlinx.coroutines.scheduling.TaskImpl.run(Tasks.kt:99) 
at kotlinx.coroutines.scheduling.CoroutineScheduler.runSafely(CoroutineScheduler.kt:585) 
at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.executeTask(CoroutineScheduler.kt:802) 
at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.runWorker(CoroutineScheduler.kt:706) 
at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.run(CoroutineScheduler.kt:693) 

ios:

0   composeui                           0x10e50c5d7        kfun:kotlin.Throwable#<init>(){} + 95 (/opt/buildAgent/work/b2e1db4d8d903ca4/kotlin/kotlin-native/runtime/src/main/kotlin/kotlin/Throwable.kt:32:28)
1   composeui                           0x10e50589f        kfun:kotlin.Exception#<init>(){} + 87 (/opt/buildAgent/work/b2e1db4d8d903ca4/kotlin/kotlin-native/runtime/src/main/kotlin/kotlin/Exceptions.kt:21:35)
2   composeui                           0x110063c33        kfun:com.prof18.rssparser.exception.RssParsingException#<init>(kotlin.String?;kotlin.Throwable?){} + 107 (/Users/runner/work/RSS-Parser/RSS-Parser/rssparser/src/commonMain/kotlin/com/prof18/rssparser/exception/RssParsingException.kt:12:5)
3   composeui                           0x11008ed37        kfun:com.prof18.rssparser.internal.IosXmlParser.parseXML$lambda$3$lambda$1#internal + 299 (/Users/runner/work/RSS-Parser/RSS-Parser/rssparser/src/iosMain/kotlin/com/prof18/rssparser/internal/IosXmlParser.kt:32:33)
4   composeui                           0x11008fc37        kfun:com.prof18.rssparser.internal.IosXmlParser.$parseXML$lambda$3$lambda$1$FUNCTION_REFERENCE$2.invoke#internal + 103 (/Users/runner/work/RSS-Parser/RSS-Parser/rssparser/src/iosMain/kotlin/com/prof18/rssparser/internal/IosXmlParser.kt:26:13)

The link of the RSS Feed https://myrskyla.fi/feed/

I was able to fix it by replacing this (and some more likely offending chars http://www.javascripter.net/faq/accentedcharacters.htm) manually:

val feedString = xmlFetcher.fetchXmlAsString(url)
val feedStringFixed = feedString
            .replace("& auml;", "Ä")
            .replace("& Ouml;", "Ö")
val channel = parser.parse(feedStringFixed)

But i needed to fetch the feed myself because built-in XmlFetcher is internal class. So would be good to

  1. try unescaping chars if parsing fails or/and making XmlFetcher interface accessible
  2. add possibility to override or use XmlFetcher.
kbios commented 1 month ago

This also affects RSS feeds which fail to escape the ampersand when it's used in the text, like the arstechnica one (as of now): https://feeds.arstechnica.com/arstechnica/index

(Attached below for posterity) arstechnica.txt

prof18 commented 1 month ago

Thanks for reporting this issue. The "right" way would be to have the feed owner add the proper CDATA escape.

I've done some research and there's no "smart" way to fix that.

I'll consider adding some settings in the builder to allow replacing some strings, but for now, the suggested way is manually fetching the feed as a string and parsing it with the parse method.