samuelclay / NewsBlur

NewsBlur is a personal news reader that brings people together to talk about the world. A new sound of an old instrument.
http://www.newsblur.com
MIT License
6.92k stars 1k forks source link

Backend: Cluster articles using Elasticsearch MoreLikeThis #1531

Open alin-simionoiuDE opened 3 years ago

alin-simionoiuDE commented 3 years ago

I love NewsBlur, been a paying customer for a while now. the one feature that I would love to have is removing duplicate articles at the folder level. preferably some sort of checkbox when I open "folder settings" maybe

I keep my feeds in folders, and I always find duplicated articles inside the folder. Not surprising really, if there's a good subject out there multiple feeds are going to have articles about it.

Looks like inoreader has this feature.

Screen Shot 2021-08-10 at 1 09 59 PM
samuelclay commented 3 years ago

NewsBlur already tries very hard to remove duplicate articles automatically. Unfortunately they still come through, so let's use this ticket as a reminder to double check the dupe checker and to write tests for it to ensure it works as we expect it to. I see no reason to have this as a preference. Duplicates should be (and are) filtered out automatically.

Here's the code that does the work:

https://github.com/samuelclay/NewsBlur/blob/master/apps/rss_feeds/models.py#L2014-L2117

samuelclay commented 3 years ago

Oh, you mean from unrelated feeds in the folder? That's a different kind of check, one that I'm working on now using ElasticSearch's MoreLikeThis query.

alin-simionoiuDE commented 3 years ago

Yes! From unrelated feeds in a folder

Merkur9 commented 3 years ago

I like the idea very much. It is similar to what Google News is doing: Cluster the news for a specific event / topic across different sources. See screenshot: 2021-09-30_10-41-41

Merkur9 commented 3 years ago

One idea I had about this: I guess it would already help, if Newsblur could cluster by the trained tags. E.g. I train the word "Amazon" and it clusters everything that has Amazon in the title - independent of the source