lwindolf / liferea

Liferea (Linux Feed Reader), a news reader for GTK/GNOME
https://lzone.de/liferea
GNU General Public License v2.0
804 stars 131 forks source link

Is it really that difficult to stop duplicate downloads? #1361

Closed AndyM48 closed 1 week ago

AndyM48 commented 1 month ago

A picture is worth a thousand words:

2024-05-13_07-22

These are from the BBC Feed (http://feeds.bbci.co.uk/news/rss.xml). Isn't it just a question of comparing titles and times?

lwindolf commented 1 month ago

No, I believe it isn't that simple. Please check out the complexity of the item comparison code in src/itemset.c there is already a lot of logic eliminating duplication.

The BBC feed in question provides unique identifiers for feed items, if those are present a difference in those is taken as an indication of different items. If such a feed provider issues the same content with a new UID the RSS spec says it is to be considered new content.

There are use cases where you want it and your suggestion would kill the use case. For example an feed alerting on something and providing the same content at different times to show you that a problem does persist.

AndyM48 commented 1 month ago

Thank you for the explanation. I understand what you have said. Could there be an option, or maybe a plugin, to hide "apparent" duplicates, ie. ignore the UID when displaying the feeds?

lwindolf commented 1 month ago

Such an option would be possible. Maintaining the feature is the problem. This is a one man project, all code paths that the maintainer does not use daily tend to rot :-(

AndyM48 commented 1 month ago

This is really very frustrating. Many, many feeds have apparently duplicated items, especially from the BBC. The only difference in the sql database (items) seems to be in the source_id where a number is appended to the string eg:

https://www.bbc.com/sport/football/videos/cx88ezex0jzo#5
https://www.bbc.com/sport/football/videos/cx88ezex0jzo#6

Are the the "unique identifiers " you referred to above?

There is an informative article here

AndyM48 commented 1 month ago

So I solved this problem, which seems to mainly affect the BBC feeds. Thanks to DanQ for the info.

The answer was to intercept the BBC feed and remove the "#nn" numbers which the BBC had helpfully added to each guid. Unfortunately I could not get the ruby script that DanQ offered to work so I rewrote it in tcl, and it works fine.