lemon24 / reader

A Python feed reader library.
https://reader.readthedocs.io
BSD 3-Clause "New" or "Revised" License
444 stars 37 forks source link

Automatic .dedupe.once.title, sometimes #322

Open lemon24 opened 1 year ago

lemon24 commented 1 year ago

I got a feed with duplicate entries because the ids for all the entries changed; content dedupe didn't work for (most of?) them, likely because the content formatting/suffixes changed (todo: check).

I fixed it with .dedupe.once.title, checking beforehand that:

There's no reason the plugin can't do these checks in code.

davidag commented 11 months ago

Hello @lemon24! 👋🏼

I'd like to help with this issue if possible. I could use a bit of help though :)

Taking a look at the problematic feed, I don't see content/summary fields, but you mentioned they probably had changed. Maybe they are gone now? Am I missing something?

Beyond that, I'm thinking about how the solution would look like:

  1. On the after_feed_update hook, if there are no dedupe-specific tags, check if all old entries are duplicated with new ones (using only titles).
    1. Get all entries and separate between old and new, checking entry.added == entry.last_updated to distinguish new entries.
    2. Check if all entries in the old set have a corresponding one with the same title in the new set.
  2. If the check in step 1 is positive, run the code for .dedupe.once.title already present in the aforementioned hook.

What do you think?

Thanks 🙏🏼 and great project 💯

lemon24 commented 11 months ago

Hi @davidag, thank you for your interest!

Taking a look at the problematic feed [...]

I checked a backup and the old entries didn't have content/summary either, so the pairs were not deduped because the body of these for loops never got a chance to run (and wouldn't have, unless both entries in a pair had content).

This is partly by design, the current code tries very hard not to delete data – "when in doubt, keep both".

I'm thinking about how the solution would look like:

Indeed, most of the logic should happen in after_feed_update (the stuff in after_entry_update should have probably been there from the start).

Here's what I believe the complete logic may look like; it matches your outline (with one difference noted below):

def after_entry_update_hook:
    tag new entries with '.dedupe._new'

def after_feed_update_hook:
    # optimization, not possible at the moment;
    # would require the hook to receive the UpdatedFeed,
    # or get_entries(tags='.dedupe._new') (filtering by entry tags)
    if there are no new entries:
        return

    collect all entry ids and titles
    group collected entries by title
    exclude groups with no more than 1 entry
    if feed does not have any '.dedupe.once*' tag:
        exclude groups that do not have new entries

    # optimization
    if there are no groups:
        clear '.dedupe._new' tag from entries
        return

    # select how strict we are about what we consider duplicates
    if feed has '.dedupe.once.title' tag:
        # user said so
        is_duplicate = is_duplicate_title
    elif (
        none of the old entries have duplicate titles
        and none of the new entries have duplicate titles
        and most new entries have old entries with the same title
    )
        # reasonably safe to dedupe by title alone
        is_duplicate = _is_duplicate_title
    else:
        # similarity dedupe
        is_duplicate = _is_duplicate_full

    run _dedupe_entries for each group (original logic)
    clear '.dedupe._new' tag from entries

Some notes:

Once again, thank you, and don't hesitate to ask any follow-up questions if needed.