lemon24 / reader

A Python feed reader library.
https://reader.readthedocs.io
BSD 3-Clause "New" or "Revised" License
439 stars 36 forks source link

Spurious feed updates due to RSS feed lastBuildDate #231

Closed lemon24 closed 3 years ago

lemon24 commented 3 years ago

Some feeds like the www.scpwiki.com example in #225 are also changing their .updated when the content changes (based on a timestamp). This results in the feed always being updated in the past hour, even if nothing actually changed.

Note: This was happening before #179 as well.

For the specific feed below, .updated seems to come from RSS lastBuildDate, despite the feedparser docs saying it comes from dc:date; there's no mention of lastBuildDate in the documentation, but there is plenty in the source, indicating that it is by design.

>>> url
'http://www.scpwiki.com/feed/pages/created_by/qntm/t/SCP%20Foundation%3A%20qntm'
>>> f = feedparser.parse(url)
>>> f.feed.updated
'Mon, 29 Mar 2021 18:38:46 +0000'
<?xml version="1.0" encoding="UTF-8" ?>
<rss version="2.0" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:wikidot="http://www.wikidot.com/rss-namespace">
    <channel>
        <title>SCP Foundation: qntm</title>
        <link>http://scp-wiki.wikidot.com</link>
        <description>REDACTED</description>
        <copyright></copyright>
        <lastBuildDate>Mon, 29 Mar 2021 18:38:46 +0000</lastBuildDate>

        <!-- <item> elements -->

    </channel>
</rss>
lemon24 commented 3 years ago

So, the simplest solution is to:

We could limit the amount of consecutive feed updates, but it doesn't seem needed.

lemon24 commented 3 years ago

Time spent: 2.5h (1h on the initial investigation, 1.5h on the implementation).