prof18 / RSS-Parser

A Kotlin Multiplatform library to parse a RSS Feed
Apache License 2.0
503 stars 129 forks source link

Parsing feed fails if it has html encoded characters #204

Open shtolik opened 2 months ago

shtolik commented 2 months ago

Describe the bug I tried to parse the feed https://myrskyla.fi/feed/ but it contains in a title tag Ä instead of Ä which then leads to exceptions and failing to parse feed both on android and ios side. android:

RssParsingException(message=Something went wrong during the parsing of the feed. Please check if the XML is valid, cause=org.xmlpull.v1.XmlPullParserException: unresolved: ä (position:TEXT @11:22 in java.io.InputStreamReader@4290534) )
at com.prof18.rssparser.internal.AndroidXmlParser$parseXML$2.invokeSuspend(AndroidXmlParser.kt:67)
at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:33)
at kotlinx.coroutines.DispatchedTask.run(DispatchedTask.kt:104)
at kotlinx.coroutines.internal.LimitedDispatcher$Worker.run(LimitedDispatcher.kt:111)
at kotlinx.coroutines.scheduling.TaskImpl.run(Tasks.kt:99)
at kotlinx.coroutines.scheduling.CoroutineScheduler.runSafely(CoroutineScheduler.kt:585)
at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.executeTask(CoroutineScheduler.kt:802)
at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.runWorker(CoroutineScheduler.kt:706)
at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.run(CoroutineScheduler.kt:693)
Caused by: org.xmlpull.v1.XmlPullParserException: unresolved: ä (position:TEXT @11:22 in java.io.InputStreamReader@4290534)
at com.android.org.kxml2.io.KXmlParser.checkRelaxed(KXmlParser.java:305)
at com.android.org.kxml2.io.KXmlParser.readEntity(KXmlParser.java:1285)
at com.android.org.kxml2.io.KXmlParser.readValue(KXmlParser.java:1402)
at com.android.org.kxml2.io.KXmlParser.next(KXmlParser.java:393)
at com.android.org.kxml2.io.KXmlParser.next(KXmlParser.java:313)
at com.android.org.kxml2.io.KXmlParser.nextText(KXmlParser.java:2077)
at com.prof18.rssparser.internal.XmlPullParser_Kt.nextTrimmedText(XmlPullParser+.kt:5)
at com.prof18.rssparser.internal.rss.RssParserKt.extractRSSContent(RssParser.kt:289)
at com.prof18.rssparser.internal.AndroidXmlParser$parseXML$2.invokeSuspend(AndroidXmlParser.kt:54)
at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:33) 
at kotlinx.coroutines.DispatchedTask.run(DispatchedTask.kt:104) 
at kotlinx.coroutines.internal.LimitedDispatcher$Worker.run(LimitedDispatcher.kt:111) 
at kotlinx.coroutines.scheduling.TaskImpl.run(Tasks.kt:99) 
at kotlinx.coroutines.scheduling.CoroutineScheduler.runSafely(CoroutineScheduler.kt:585) 
at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.executeTask(CoroutineScheduler.kt:802) 
at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.runWorker(CoroutineScheduler.kt:706) 
at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.run(CoroutineScheduler.kt:693) 

ios:

0   composeui                           0x10e50c5d7        kfun:kotlin.Throwable#<init>(){} + 95 (/opt/buildAgent/work/b2e1db4d8d903ca4/kotlin/kotlin-native/runtime/src/main/kotlin/kotlin/Throwable.kt:32:28)
1   composeui                           0x10e50589f        kfun:kotlin.Exception#<init>(){} + 87 (/opt/buildAgent/work/b2e1db4d8d903ca4/kotlin/kotlin-native/runtime/src/main/kotlin/kotlin/Exceptions.kt:21:35)
2   composeui                           0x110063c33        kfun:com.prof18.rssparser.exception.RssParsingException#<init>(kotlin.String?;kotlin.Throwable?){} + 107 (/Users/runner/work/RSS-Parser/RSS-Parser/rssparser/src/commonMain/kotlin/com/prof18/rssparser/exception/RssParsingException.kt:12:5)
3   composeui                           0x11008ed37        kfun:com.prof18.rssparser.internal.IosXmlParser.parseXML$lambda$3$lambda$1#internal + 299 (/Users/runner/work/RSS-Parser/RSS-Parser/rssparser/src/iosMain/kotlin/com/prof18/rssparser/internal/IosXmlParser.kt:32:33)
4   composeui                           0x11008fc37        kfun:com.prof18.rssparser.internal.IosXmlParser.$parseXML$lambda$3$lambda$1$FUNCTION_REFERENCE$2.invoke#internal + 103 (/Users/runner/work/RSS-Parser/RSS-Parser/rssparser/src/iosMain/kotlin/com/prof18/rssparser/internal/IosXmlParser.kt:26:13)

The link of the RSS Feed https://myrskyla.fi/feed/

I was able to fix it by replacing this (and some more likely offending chars http://www.javascripter.net/faq/accentedcharacters.htm) manually:

val feedString = xmlFetcher.fetchXmlAsString(url)
val feedStringFixed = feedString
            .replace("& auml;", "Ä")
            .replace("& Ouml;", "Ö")
val channel = parser.parse(feedStringFixed)

But i needed to fetch the feed myself because built-in XmlFetcher is internal class. So would be good to

  1. try unescaping chars if parsing files or/and making XmlFetcher interface accessible
  2. add possibility to override or use XmlFetcher.
kbios commented 4 days ago

This also affects RSS feeds which fail to escape the ampersand when it's used in the text, like the arstechnica one (as of now): https://feeds.arstechnica.com/arstechnica/index

(Attached below for posterity) arstechnica.txt