prof18 / RSS-Parser

A Kotlin Multiplatform library to parse a RSS Feed
Apache License 2.0
516 stars 128 forks source link

SAXParseException trying to parse a feed #205

Closed orelvis15 closed 1 month ago

orelvis15 commented 3 months ago

I'm trying to parse this feed https://cu.usembassy.gov/es/feed/ but it's giving me this error. I was looking for some error in the xml but I can't find it.

Something went wrong during the parsing of the feed. Please double check if the XML is valid
2024-08-09T06:49:57.147232+00:00 app[web.1]: com.prof18.rssparser.exception.RssParsingException: Something went wrong during the parsing of the feed. Please double check if the XML is valid
2024-08-09T06:49:57.147233+00:00 app[web.1]:    at com.prof18.rssparser.internal.JvmXmlParser$parseXML$2.invokeSuspend(JvmXmlParser.kt:37)
2024-08-09T06:49:57.147233+00:00 app[web.1]:    at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:33)
2024-08-09T06:49:57.147234+00:00 app[web.1]:    at kotlinx.coroutines.DispatchedTask.run(DispatchedTask.kt:104)
2024-08-09T06:49:57.147235+00:00 app[web.1]:    at kotlinx.coroutines.internal.LimitedDispatcher$Worker.run(LimitedDispatcher.kt:111)
2024-08-09T06:49:57.147236+00:00 app[web.1]:    at kotlinx.coroutines.scheduling.TaskImpl.run(Tasks.kt:99)
2024-08-09T06:49:57.147236+00:00 app[web.1]:    at kotlinx.coroutines.scheduling.CoroutineScheduler.runSafely(CoroutineScheduler.kt:584)
2024-08-09T06:49:57.147237+00:00 app[web.1]:    at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.executeTask(CoroutineScheduler.kt:811)
2024-08-09T06:49:57.147237+00:00 app[web.1]:    at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.runWorker(CoroutineScheduler.kt:715)
2024-08-09T06:49:57.147237+00:00 app[web.1]:    at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.run(CoroutineScheduler.kt:702)
2024-08-09T06:49:57.147238+00:00 app[web.1]: Caused by: org.xml.sax.SAXParseException: The element type "meta" must be terminated by the matching end-tag "</meta>".
2024-08-09T06:49:57.147239+00:00 app[web.1]:    at java.xml/com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.createSAXParseException(ErrorHandlerWrapper.java:204)
2024-08-09T06:49:57.147249+00:00 app[web.1]:    at java.xml/com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.fatalError(ErrorHandlerWrapper.java:178)
2024-08-09T06:49:57.147250+00:00 app[web.1]:    at java.xml/com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:400)
2024-08-09T06:49:57.147250+00:00 app[web.1]:    at java.xml/com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:327)
2024-08-09T06:49:57.147250+00:00 app[web.1]:    at java.xml/com.sun.org.apache.xerces.internal.impl.XMLScanner.reportFatalError(XMLScanner.java:1465)
2024-08-09T06:49:57.147252+00:00 app[web.1]:    at java.xml/com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1695)
2024-08-09T06:49:57.147252+00:00 app[web.1]:    at java.xml/com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2899)
2024-08-09T06:49:57.147252+00:00 app[web.1]:    at java.xml/com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:605)
2024-08-09T06:49:57.147253+00:00 app[web.1]:    at java.xml/com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:542)
2024-08-09T06:49:57.147253+00:00 app[web.1]:    at java.xml/com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:889)
2024-08-09T06:49:57.147253+00:00 app[web.1]:    at java.xml/com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:825)
2024-08-09T06:49:57.147254+00:00 app[web.1]:    at java.xml/com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141)
2024-08-09T06:49:57.147254+00:00 app[web.1]:    at java.xml/com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1224)
2024-08-09T06:49:57.147254+00:00 app[web.1]:    at java.xml/com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:637)
2024-08-09T06:49:57.147254+00:00 app[web.1]:    at java.xml/com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl.parse(SAXParserImpl.java:326)
2024-08-09T06:49:57.147255+00:00 app[web.1]:    at java.xml/javax.xml.parsers.SAXParser.parse(SAXParser.java:197)
2024-08-09T06:49:57.147255+00:00 app[web.1]:    at com.prof18.rssparser.internal.JvmXmlParser$parseXML$2.invokeSuspend(JvmXmlParser.kt:32)
2024-08-09T06:49:57.147256+00:00 app[web.1]:    ... 8 common frames omitted
prof18 commented 1 month ago

Hi,

the parser is not working because the website is not returning the feed but an error page because it blocks "non-browser" user agents.

Screenshot 2024-10-06 at 22 19 02

You can resolve that by customizing the User-Agent and passing the same User-Agent that the browser uses

val rssParser = RssParserBuilder(
    OkHttpClient.Builder()
        .addNetworkInterceptor { chain ->
            chain.proceed(
                chain.request()
                    .newBuilder()
                    .header(
                        "User-Agent",
                        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36"
                    )
                    .build()
            )
        }
        .build()
).build()

Closing because it's not a bug.