rbren / rss-parser

A lightweight RSS parser, for Node and the browser
MIT License
1.38k stars 209 forks source link

Error (sax) reading one particular RSS 2.0 MediaRSS feed #205

Closed yPhil-gh closed 3 years ago

yPhil-gh commented 3 years ago

Hi, thanks a lot for this project, it rocks.

This feed: https://exode.me/feeds/videos.xml?videoChannelId=484

errors:

Error: Status code 406
0|petrolette|at ClientRequest.<anonymous> (node_modules/rss-parser/lib/parser.js:88:25)
0|petrolette|at Object.onceWrapper (events.js:422:26)
0|petrolette|at ClientRequest.emit (events.js:315:20)
0|petrolette|at HTTPParser.parserOnIncomingClient (_http_client.js:641:27)
0|petrolette|at HTTPParser.parserOnHeadersComplete (_http_common.js:126:17)
0|petrolette|at TLSSocket.socketOnData (_http_client.js:509:22)
0|petrolette|at TLSSocket.emit (events.js:315:20)
0|petrolette|at addChunk (internal/streams/readable.js:309:12)
0|petrolette|at readableAddChunk (internal/streams/readable.js:284:9)
0|petrolette|at TLSSocket.Readable.push (internal/streams/readable.js:223:10)

In the validator, there are warnings, but the feed is read by other readers, like node-feedParser (1) as one can see here.

Apparently the problem lies in unrecognized tags, like <media:community> ; I tried various options:


let parser = new Parser({
  headers: {
    'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.63 Safari/537.36',
    'accept': 'text/html,application/xhtml+xml'
  },
  maxRedirects: 100,
  requestOptions: {
    rejectUnauthorized: false
  },
  defaultRSS: 2.0,
  xml2js: {
    emptyTag: 'media:community',
  },
  customFields: {
    item: [
      ['media:community', 'media:content', {keepArray: true}],
    ]
  }
});

But none seem to work for this feed, can this be (fingers crossed) sorted using an option, or is it a #bug?

EDIT : so far I traced it to be a known sax error:

Error: Non-whitespace before first tag.
Line: 0
Column: 1
Char: {
at error (node_modules/sax/lib/sax.js:651:10)
at strictFail (node_modules/sax/lib/sax.js:677:7)
at beginWhiteSpace (node_modules/sax/lib/sax.js:951:7)
at SAXParser.write (node_modules/sax/lib/sax.js:1006:11)
at Parser.exports.Parser.Parser.parseString (node_modules/xml2js/lib/parser.js:323:31)
at Parser.parseString (node_modules/xml2js/lib/parser.js:5:59)
at node_modules/rss-parser/lib/parser.js:33:22
at new Promise (<anonymous>)
at Parser.parseString (node_modules/rss-parser/lib/parser.js:32:16)
at IncomingMessage.<anonymous> (node_modules/rss-parser/lib/parser.js:96:23)

When DLoaded via wget the feed doesn't have anything before the first tag, it does however, end with a % after the last tag, so maybe that's the problem :

head -2 videos.xml\?videoChannelId=484
<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:media="http://search.yahoo.com/mrss/">

tail -2 videos.xml\?videoChannelId=484
    </channel>
</rss>%

Feeds that work don't have that trailing char.

At first I though "Oh well, rss-parser is not MediaRSS-compatible" but

  1. I have at least one MediaRSS here, that it reads fine
  2. The Atom version of this very RSS feed (the server offers both MediaRSS 2.0, Atom 1.0 and JSon 1.0) that, by the way, validates, is also in error 406, and also has the trailing %:
# head -2 videos.atom\?videoChannelId=484                    
<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
# tail -2 videos.atom\?videoChannelId=484
    </entry>
</feed>% 

Can sax errors be ignored somehow, as they seem to be "generally non critical" (I read this now and then, and this seems to be just the case right now)?

1-the opposite can be true, I have one feed here, that only rss-reader can read, thanks to the rejectUnauthorized: false trick :)

IlyaDiallo commented 3 years ago

at strictFail (node_modules/sax/lib/sax.js:677:7) What's the result of the strict: false option of xml2js ? https://github.com/Leonidas-from-XIV/node-xml2js#options

yPhil-gh commented 3 years ago

What's the result of the strict: false option of xml2js ?

Same :(

rbren commented 3 years ago

You can try downloading the feed yourself, sanitizing the string, and then passing it to parseString

rbren commented 3 years ago

Closing this as it seems like an issue w/ the underlying feed.