rbren / rss-parser

A lightweight RSS parser, for Node and the browser
MIT License
1.38k stars 210 forks source link

Reddit changed their RSS format, (partially) invalidating the README #199

Closed Pomax closed 3 years ago

Pomax commented 3 years ago

I have a simple reddit image board catchup program that grabs the reddit RSS for a subreddit (much like the README shows), and then downloads all images posted "since some date" by looking at the <pubDate> of each entry. However, Reddit changed its RSS format and no longer reports datetimes using <pubDate> and <isoDate>, instead using <published> and <updated>, and those fields are not automatically parsed.

It might be worth updating the README, or even making rss-parser parse all nodes by default (in a v4, to avoid breaking codebases that rely on v3's default behaviour) and "nothing except what's in your list of fields to parse" if the user needs more control and manually specifies the fields they need.

For example, https://www.reddit.com/r/redpandas/new.rss?limit=1 yields the following RSS:

<?xml version="1.0" encoding="UTF-8"?>
<feed xmlns="http://www.w3.org/2005/Atom"
    xmlns:media="http://search.yahoo.com/mrss/">
    <category term="redpandas" label="r/redpandas"/>
    <updated>2021-04-02T16:06:59+00:00</updated>
    <icon>https://www.redditstatic.com/icon.png/</icon>
    <id>/r/redpandas/new.rss?limit=1</id>
    <link rel="self" href="https://www.reddit.com/r/redpandas/new.rss?limit=1" type="application/atom+xml" />
    <link rel="alternate" href="https://www.reddit.com/r/redpandas/new?limit=1" type="text/html" />
    <logo>https://a.thumbs.redditmedia.com/SU2rJah4uwVYZrBB.png</logo>
    <subtitle>The place for all things red panda!</subtitle>
    <title>newest submissions : redpandas</title>
    <entry>
        <author>
            <name>/u/BarrsTool</name>
            <uri>https://www.reddit.com/user/BarrsTool</uri>
        </author>
        <category term="redpandas" label="r/redpandas"/>
        <content type="html">&amp;#32; submitted by &amp;#32; &lt;a href=&quot;https://www.reddit.com/user/BarrsTool&quot;&gt; /u/BarrsTool &lt;/a&gt; &lt;br/&gt; &lt;span&gt;&lt;a href=&quot;https://www.facebook.com/52096703878/posts/10159036808223879/&quot;&gt;[link]&lt;/a&gt;&lt;/span&gt; &amp;#32; &lt;span&gt;&lt;a href=&quot;https://www.reddit.com/r/redpandas/comments/mi6jan/vote_for_ben_the_red_panda_he_needs_your_help/&quot;&gt;[comments]&lt;/a&gt;&lt;/span&gt;</content>
        <id>t3_mi6jan</id>
        <link href="https://www.reddit.com/r/redpandas/comments/mi6jan/vote_for_ben_the_red_panda_he_needs_your_help/" />
        <updated>2021-04-01T22:20:00+00:00</updated>
        <published>2021-04-01T22:20:00+00:00</published>
        <title>Vote for Ben the Red Panda! He needs your help!</title>
    </entry>
</feed>

So rss-parser won't see any datetimes associated with entries.

Pomax commented 3 years ago

I also tried the following, but that doesn't seem to work:

const Parser = require("rss-parser");
const parser = new Parser({
  customFields: {
    item: ["updated", "published"],
    entry: ["updated", "published"],
  },
});

Whether I use item or entry, the resulting parsed object does not contain an .updated or .published property to work with.

Pomax commented 3 years ago

hmmm... so it's been broken the last few days, I tried again this morning, and even though the entry XML doesn't look any different, things work again.

<?xml version="1.0" encoding="UTF-8"?>
<feed xmlns="http://www.w3.org/2005/Atom"
    xmlns:media="http://search.yahoo.com/mrss/">
    <category term="redpandas" label="r/redpandas"/>
    <updated>2021-04-03T18:18:30+00:00</updated>
    <icon>https://www.redditstatic.com/icon.png/</icon>
    <id>/r/redpandas/new.rss?limit=1</id>
    <link rel="self" href="https://www.reddit.com/r/redpandas/new.rss?limit=1" type="application/atom+xml" />
    <link rel="alternate" href="https://www.reddit.com/r/redpandas/new?limit=1" type="text/html" />
    <logo>https://a.thumbs.redditmedia.com/SU2rJah4uwVYZrBB.png</logo>
    <subtitle>The place for all things red panda!</subtitle>
    <title>newest submissions : redpandas</title>
    <entry>
        <author>
            <name>/u/li_the_great</name>
            <uri>https://www.reddit.com/user/li_the_great</uri>
        </author>
        <category term="redpandas" label="r/redpandas"/>
        <content type="html">&lt;table&gt; &lt;tr&gt;&lt;td&gt; &lt;a href=&quot;https://www.reddit.com/r/redpandas/comments/mjag8c/shalei_was_ready_for_her_glamour_shots_roger/&quot;&gt; &lt;img src=&quot;https://external-preview.redd.it/oQ-V3Hv71lKdiVS3m0Jvnr36R1ZkM6WkeJhzQL6lTRc.jpg?width=640&amp;amp;crop=smart&amp;amp;auto=webp&amp;amp;s=67b5d2cf8a74165ffab2186585c958af5dd70817&quot; alt=&quot;Sha-lei was ready for her glamour shots - Roger Williams Park Zoo, Rhode Island&quot; title=&quot;Sha-lei was ready for her glamour shots - Roger Williams Park Zoo, Rhode Island&quot; /&gt; &lt;/a&gt; &lt;/td&gt;&lt;td&gt; &amp;#32; submitted by &amp;#32; &lt;a href=&quot;https://www.reddit.com/user/li_the_great&quot;&gt; /u/li_the_great &lt;/a&gt; &lt;br/&gt; &lt;span&gt;&lt;a href=&quot;https://imgur.com/Zhp1TLA&quot;&gt;[link]&lt;/a&gt;&lt;/span&gt; &amp;#32; &lt;span&gt;&lt;a href=&quot;https://www.reddit.com/r/redpandas/comments/mjag8c/shalei_was_ready_for_her_glamour_shots_roger/&quot;&gt;[comments]&lt;/a&gt;&lt;/span&gt; &lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;</content>
        <id>t3_mjag8c</id>
        <media:thumbnail url="https://external-preview.redd.it/oQ-V3Hv71lKdiVS3m0Jvnr36R1ZkM6WkeJhzQL6lTRc.jpg?width=640&amp;crop=smart&amp;auto=webp&amp;s=67b5d2cf8a74165ffab2186585c958af5dd70817" />
        <link href="https://www.reddit.com/r/redpandas/comments/mjag8c/shalei_was_ready_for_her_glamour_shots_roger/" />
        <updated>2021-04-03T15:15:30+00:00</updated>
        <published>2021-04-03T15:15:30+00:00</published>
        <title>Sha-lei was ready for her glamour shots - Roger Williams Park Zoo, Rhode Island</title>
    </entry>
</feed>

The only difference I see is the new media:thumbnail element but that has nothing to do with figuring out the pubDate/isoDate, so... I have no idea what happened, I'll mark this as invalid and refile it it happens again.