mediacloud / rss-fetcher

Intelligently fetch lists of URLs from a large collection of RSS Feeds as part of the Media Cloud Directory.
https://search.mediacloud.org/directory
Apache License 2.0
5 stars 5 forks source link

Add source information to RSS output #29

Closed philbudne closed 4 months ago

philbudne commented 8 months ago

Improve traceability of story source by adding feed/source information to generated RSS file.

The RSS 2.0 specification says this about the <source> element:

4.1.1.20.9 source

An item's source element indicates the fact that the item has been republished from another RSS feed (optional). The element must have a url attribute that identifies the URL of the source feed.

The value of the source could be the title of the source feed.

<source url="http://la.example.com/rss.xml">Los Angeles Herald-Examiner</source>

The original feed url would be sufficient, and the feed title need not be included. We could also add feed_id and sources_id attributes to the source element.

philbudne commented 5 months ago

@rahulbot @pgulley this is the rss-fetcher issue I mentioned regarding support for https://github.com/mediacloud/story-indexer/issues/216

Opinions on what data we should add? feed URL in the url attribute? feed_id and/or sources_id in custom attributes? Do we need to consult with other users of our RSS files?

rahulbot commented 5 months ago

I'd say source.url value should be the RSS feed URL, perhaps with some custom attributes. An idea:

<item>
    ...
    <source url="https://url.to.rss/feed" mcFeedId="1234" mcSourceId="4321" />
   ...
</item>

(I don't think having bigger RSS files is much of a concern)

rahulbot commented 4 months ago

Closing - this got rolled out to production already.