Deduplicate Feature - Githubissues

dslovin commented 3 years ago

Occasionally, I get duplicate entries of the same feed due reading a feed at the source as well as an aggregator like hackernews. I would love to be able to dedupe based on the following fields: 1) Link 2) Title 3) (bonus) Similar titles

(edit for spelling)

moonheart commented 2 years ago

Some sites have original posts news and copied news, when I subscribed these feeds, I always saw duplicate articles in several feeds . I hope to deduplicate similar entries in multiple or all deeds.

When adding a new entry, miniflux checks recent old entries and calculate similarity, if there is an entry reached the configured threshold, the new entry is marked removed or read.

For similarity calculation, maybe we can first split words and use Cosine similarity, or simply use equals. Users can configure how to calculate similarity, title or content.

somini commented 2 years ago

Hopefully the "Mark as Read" option is available. That's what I manually do anyway.

nblock commented 2 years ago

Since Miniflux relies on PostgreSQL, maybe something like the pg_trgm extension is useful: https://www.postgresql.org/docs/current/pgtrgm.html?

ajtatum commented 2 years ago

This would be an awesome feature. A lot of times a writer writes for his own blog and then reposts somewhere else, but it's in the same category of Miniflux with the same title. If it were to remove either entry (preferably keeping the first), then that would be awesome.

Sieboldianus commented 1 year ago

Came here with a slightly different (but related) problem: Some of my feeds - largely big newspapers, re-publish the same articles over time. This particularly applies to essays, I think they want to push it a number of times so their website appears "more active", without adding any new information. But it is frustrating to see the same posts popping up again and again, it is wasting my time.

I was wondering whether a deduplication feature could also have some temporal comparison check such as "The same article heading was published 1 month ago, 2 years ago etc." to then get hidden from standard view.

Functional wise, it would be pretty similar: One needs a persistent table with headings (and timestamps) to check against in Postgres.

sonor3000 commented 1 year ago

I don't know how it is developed, but ttrss has such a deduplicate feature. Maybe it can help to develope such a ffeature for Miniflux too!

tagd commented 1 month ago

As a workaround, I created a python script using the API to check for repeated URLs and remove duplicates, borrowing from a similar solution, that can be run as a cronjob on your miniflux server.

I might look into triggering it with webhooks in the future, so if anyone works that out please comment the details. Or if you tidy up the script I'd be interested to see as I don't use python often.

Note if using the get_feeds_w_dupes function adding RemoveDuplicates to a feeds blocklist will add it to future scans, if you don't want to use this function remove_duplicates can be run with a list of feed id's like remove_duplicates([1, 2])

Also if you want to also check titles, just create another set to aggregate them with the entry["title"] attribute.

# Licensed under MIT license
# Link to discussion: https://github.com/miniflux/v2/issues/797

# Steps to use:
# - Install the Miniflux python client: https://miniflux.app/docs/api.html#python-client
# - Replace rss.example.com with your Miniflux instance url
# - Replace API_KEY with your Miniflux api key which you can get here https://rss.example.com/keys

import re
import miniflux

# Behaviour: Keep first instance based on url, mark subsequent as removed
# Drawbacks: 
#    - Newer versions of article may have updates, 
#       but keeping these would lose read status and other attributes
#    - Repeat article could be so old as to have already been removed,
#       if it's been so long I assume changes might be worth reading
#    - The oldest one might not've been the one read
# Removal process: https://miniflux.app/faq.html#entries-suppression
def remove_duplicates(feed_ids): ## 
    dupe_ids = []
    for feed_id in feed_ids:
        entries = client.get_feed_entries(feed_id=feed_id, order="id", 
            direction="asc", status=["read","unread"])
        seen_urls = set()
        for entry in entries["entries"]:
            if (entry["url"] in seen_urls):
                dupe_ids.append(entry["id"])
                #print("Duplicate found " + entry["title"])
            else:
                seen_urls.add(entry["url"])
    if dupe_ids: # Repeats found
        client.update_entries(dupe_ids, status="removed")

# Get a list of feeds to check for duplicates based on blocklist text.
# To check a feed add "RemoveDuplicates" to the Blocklist_rules box
def get_feeds_w_dupes(all_feeds):
    feeds_ids = []
    for feed in all_feeds:
        if ("RemoveDuplicates" in feed["blocklist_rules"]):
            feeds_ids.append(feed["id"])
            #print("Feed " + str(feed["id"])+ " has rule RemoveDuplicates")
    return feeds_ids

client = miniflux.Client("https://rss.example.com", api_key="API_KEY")

all_feeds = client.get_feeds()

feeds_w_dupes = get_feeds_w_dupes(all_feeds)
remove_duplicates(feeds_w_dupes)

Sieboldianus commented 1 month ago

Great! But this only works on a single URL basis. In case the URL changed, but the text remained the same (with e.g. slighlty changed title), this would not be detected (I am not complaining - this is much better than nothing. Thank you so much!). There are a lot of feeds from newspapers that re-publish entries under slightly changed title every x days. Some kind of semantic similarity would be needed to catch these..

tagd commented 1 month ago

Great! But this only works on a single URL basis. In case the URL changed, but the text remained the same (with e.g. slighlty changed title), this would not be detected (I am not complaining - this is much better than nothing. Thank you so much!). There are a lot of feeds from newspapers that re-publish entries under slightly changed title every x days. Some kind of semantic similarity would be needed to catch these..

@Sieboldianus Happy to help! Here's an edit of the main function to pickup on same titles, I'd also suggest checking if the articles keep some other attribute that can be detected, like "published_at" which will be a string like "2024-06-10T15:53:17+01:00", so should be fairly unique.

def remove_duplicates(feed_ids):
    dupe_ids = []
    for feed_id in feed_ids:
        entries = client.get_feed_entries(feed_id=feed_id, order="id", 
            direction="asc", status=["read","unread"])
        seen_urls = set()
        seen_titles = set()
        for entry in entries["entries"]:
            if ((entry["url"] in seen_urls) or (entry["title"] in seen_titles)):
                dupe_ids.append(entry["id"])
                #print("Duplicate found " + entry["title"])
            else:
                seen_urls.add(entry["url"])
                seen_titles.add(entry["title"])
    if dupe_ids: # Repeats found
        client.update_entries(dupe_ids, status="removed")

For fuzzy matching I wrote this function, I've checked the matching works, but didn't have any feeds that rename titles, so haven't validated it against real data, if you could post examples feeds that would be helpful, thanks.

import re
import miniflux
from fuzzywuzzy import fuzz # pip install fuzzywuzzy
                            # Compare strings with https://en.wikipedia.org/wiki/Levenshtein_distance 
import nltk                 # pip install nltk
from nltk.corpus import stopwords # list of words to ignore
nltk.download('stopwords')  # Download stopwords (only needed on first run)

def read_duplicates(feed_ids, sensitivity = 85):
    """
    This function marks articles as read if they are a duplicate of a previously
    seen article, based on similarity of the title.

    Similarity computed by:
    Take the title, remove all punctuation, stop words (words like 'a', 'and', 
        'the', etc) and set to lower case.
    This produces a string of the important words in the title.  
    If the article is read store this processsed title in a list for comparison.
    If unread check its not already in the list if so mark read
        If not check similar based on Levenshtein distance(LD)
            words are sorted to alphabetical order before computing LD

    Args:
      feed_ids: The ids of feeds to check, like ["1", "2"]
      sensitivity: How similar two string must be to match

    """
    dupe_ids = []
    stop_words = set(stopwords.words('english')) # words to ignore
    for feed_id in feed_ids:
        seen_titles = set()
        entries = client.get_feed_entries(feed_id=feed_id, order="id", 
            direction="asc", status=["read","unread"])
        """ initially assumed we could see all read then check unread, 
            but then if we see titles like 
                '1.bad thing happens, 2.things bad happens'
            when 2. is marked read for being similar to 1. next run of 
            program first sees 2. in read, then sees 1. which is similar
            so marks 1. as read, even though neither was ever read. 
            So we have to go through by id then check read status
        """
        for entry in entries["entries"]:
            # processed title is title without joining words and lowercase
            processed_title = re.sub(r'\W+', ' ', entry["title"]) # replace non alphanumeric chars with space
            processed_title = ' '.join([word for word in processed_title.lower().split() 
                                   if word not in stop_words])
            if (entry["status"]=="read"):
                seen_titles.add(processed_title)
            else: # only check if unread mark
                if (processed_title in seen_titles):
                    print("Duplicate found "+ processed_title)
                    dupe_ids.append(entry["id"])
                else:
                    for title in seen_titles:
                        print("checking: '"+ processed_title + "' against '" + title + "'")
                        # 1-100, higher = closer matching
                        # token sort will sort words into order before comparing titles
                        if (fuzz.token_sort_ratio(processed_title, title) > sensitivity):
                            dupe_ids.append(entry["id"])
                            seen_titles.add(processed_title) 
                            ''' See matches too; so one article may have multiple seen titles 
                                and we can match between them eg
                                    The white whale > Big white whale > Big whale
                                where the initial title may not match the later title but 
                                by keeping the intermediate we recognise the final version
                            '''
                            break # match found stop checking
                    seen_titles.add(processed_title) 
    if dupe_ids: # Repeats found
        client.update_entries(dupe_ids, status="read")

Sieboldianus commented 1 month ago

Nice, thank you! I will definitely test it.

didn0t commented 2 weeks ago

As a workaround, I created a python script using the API to check for repeated URLs and remove duplicates, borrowing from a similar solution, that can be run as a cronjob on your miniflux server.

Thanks for your script. I did notice that I had to add a limit parameter to client.get_feed_entries as it defaults to 100.

miniflux / v2

Deduplicate Feature #797