gugray / rss-parrot

Notifies Mastodon accounts about new posts in the RSS feeds they follow
https://rss-parrot.net
MIT License
109 stars 7 forks source link

Detect duplicate title and body #31

Closed TheLastBoyScoutUK closed 7 months ago

TheLastBoyScoutUK commented 7 months ago

Some feeds (e.g. those provided by nitter.cz) duplicate the article title and body. Is it possible to detect for this and remove one or the other?

Example:

Screenshot_20240120-163311-226

TheLastBoyScoutUK commented 7 months ago

Retweets and replies aren't exactly duplicate content, but they follow a consistent logic that could be handled. Probably best to drop the body and keep the title in these examples?

Screenshot_20240122-140425-193

Screenshot_20240122-140505-992

gugray commented 7 months ago

This is not doable within the scope of RSS Parrot. It needs to stay a small and lean system, and part of that is that it doesn't do any kind of analysis or processing on the feed content. Developing heuristics to detect similar, sort-of-duplicate content within posts, let alone across posts, is beyond the birb's means.

TheLastBoyScoutUK commented 7 months ago

Perhaps just an option to take from the feed the <title> only and excludes the <description>?

E.g. if it were a cli tool, something like @birb [feed] --title-only