Open lemon24 opened 1 year ago
Hello @lemon24! 👋🏼
I'd like to help with this issue if possible. I could use a bit of help though :)
Taking a look at the problematic feed, I don't see content/summary fields, but you mentioned they probably had changed. Maybe they are gone now? Am I missing something?
Beyond that, I'm thinking about how the solution would look like:
after_feed_update
hook, if there are no dedupe-specific tags, check if all old entries are duplicated with new ones (using only titles).
entry.added == entry.last_updated
to distinguish new entries..dedupe.once.title
already present in the aforementioned hook.What do you think?
Thanks 🙏🏼 and great project 💯
Hi @davidag, thank you for your interest!
Taking a look at the problematic feed [...]
I checked a backup and the old entries didn't have content/summary either, so the pairs were not deduped because the body of these for loops never got a chance to run (and wouldn't have, unless both entries in a pair had content).
This is partly by design, the current code tries very hard not to delete data – "when in doubt, keep both".
I'm thinking about how the solution would look like:
Indeed, most of the logic should happen in after_feed_update (the stuff in after_entry_update should have probably been there from the start).
Here's what I believe the complete logic may look like; it matches your outline (with one difference noted below):
def after_entry_update_hook:
tag new entries with '.dedupe._new'
def after_feed_update_hook:
# optimization, not possible at the moment;
# would require the hook to receive the UpdatedFeed,
# or get_entries(tags='.dedupe._new') (filtering by entry tags)
if there are no new entries:
return
collect all entry ids and titles
group collected entries by title
exclude groups with no more than 1 entry
if feed does not have any '.dedupe.once*' tag:
exclude groups that do not have new entries
# optimization
if there are no groups:
clear '.dedupe._new' tag from entries
return
# select how strict we are about what we consider duplicates
if feed has '.dedupe.once.title' tag:
# user said so
is_duplicate = is_duplicate_title
elif (
none of the old entries have duplicate titles
and none of the new entries have duplicate titles
and most new entries have old entries with the same title
)
# reasonably safe to dedupe by title alone
is_duplicate = _is_duplicate_title
else:
# similarity dedupe
is_duplicate = _is_duplicate_full
run _dedupe_entries for each group (original logic)
clear '.dedupe._new' tag from entries
Some notes:
.dedupe._new
entry tag to tell new entries apart. This way, if the plugin fails for some reason, we can pick new entries up on some future run.Once again, thank you, and don't hesitate to ask any follow-up questions if needed.
I got a feed with duplicate entries because the ids for all the entries changed; content dedupe didn't work for (most of?) them, likely because the content formatting/suffixes changed (todo: check).
I fixed it with .dedupe.once.title, checking beforehand that:
There's no reason the plugin can't do these checks in code.