gravitystorm / blogs.osm.org

The new feed aggregator for OpenStreetMap
https://blogs.openstreetmap.org/
6 stars 17 forks source link

Spam diaries are not deleted #17

Closed AndrewHain closed 5 years ago

AndrewHain commented 7 years ago

The old site would remove diaries by deleted users as soon as it read the updated feed. The new site is currently not doing this: the Quenching oil diary still appears with its link even though the account was deleted yesterday.

harry-wood commented 7 years ago

Yes. 'Quenching oil'. Still there now on the live site. We should definitely figure out why.

I see the cheffed cronned command it runs every 30 mins is here: https://github.com/openstreetmap/chef/blob/master/cookbooks/blogs/templates/default/cron.erb So that pluto build command is not causing it to remove deleted posts

My local copy no longer has the 'Quenching oil' post showing. I'm not sure if that's because pluto build did something different for me compared to the live site, or because maybe I blew away some data in a different way at some point. Guess that proves the spam is gone from the osm diaries feed at least.

tomhughes commented 7 years ago

Well there's no real way to know when a post is deleted, because RSS feeds don't contain deletion markers.

I assume the old one was just removing anything that was no longer in the upstream feed, which means it was also expiring anything that fell off the end of the upstream feeds, which kind of relates to the question I think somebody asked on IRC or email or something about how much history there is or will be.

harry-wood commented 7 years ago

I did some experiments fiddling with a test copy of the input rss xml, and doing a pluto build each time (using quite similar command to the live site)

20 seems to be the limit. As it happens there's also 20 items in our osm diary feed rss. It gets all 20 when we build, but if I add a new item at the top of the input feed (with fake id and recent 'pubDate'). Then the 21st item is no longer displayed in the HTML. However it does remain in the sqlite database ('items' table now has 21 records). Likewise I can add new fake items on the end of the file with earlier pubDates, and these get added in the sqlite DB, but the HTML is showing max 20.

Now if I delete an item, removing from the input RSS XML, and rebuild then nothing happens. It lives on in the sqlite db and in the HTML (The spam problem we're seeing) . But if I change an item e.g. blank all the content from it, this change makes its way to the db and to the HTML. If I set an item pubDate 1900, then it disappears off the bottom.

Spam solution 1. Source RSS fudge: At the OSM end make it leave deleted items in the rss but update them to have blank description and pubDate 1900. Would that be terribly non-standard in an RSS sense? probably.

Spam solution 2. Delete the DB: Delete the sqlite database prior to building every 30 mins. The DB doesn't seem to be caching anything useful (doesn't prevent any feed requests). [Correction. It does cache. It powers conditional GET]. It's storing a back-catalogue of old items which the site doesn't display at all. rm planet.db

smsm1 commented 7 years ago

Another solution: Take the oldest item in the RSS, and if it exists in the DB, remove any items from the DB that are newer than that, which are no longer in the RSS feed. This solution would possibly need to be implemented upstream, and would be the cleanest.

Keeping the DB allows a more advanced searching of old blog posts.

harry-wood commented 7 years ago

Yes. Essentially detecting deletions. If there's something in the database, which is now gone from the RSS, and it's not gone due to being old... then delete it. Bit complicated but I think it'll work.

I think there's a theoretical hole in this logic which would allow spam in to the blogs.osm.org database if was flying through the RSS too quickly... however in the normal run of things (a combination new and old stuff in the latest RSS, and only ever displaying recent stuff anyway) it'll work.

So... not sure it's the cleanest. The clean thing about this approach is that it wont require any changes at the OSM end.

grischard commented 6 years ago

Instead of deleting the whole planet.db file, could we drop the osm diaries from the table and force a refresh right before the cronned command runs? This forces a refresh for me, and the first line might be superfluous:

delete from items where guid like "http://www.openstreetmap.org/user/%/diary/%";
update feeds set http_etag="delicious", http_last_modified="Mon, 09 Aug 2004 12:00:00 GMT"
tomhughes commented 6 years ago

The side effect is that anything that is no longer in the most recent entries on osm.org will be removed from planet right? I wonder how often (if at all) that happens?

grischard commented 6 years ago

Yes, the planet will then only contain the last x osm diary entries. As far as I can tell, old entries are purged anyway - they don't appear on https://blogs.openstreetmap.org in any case.

Another more complicated approach would be to trigger a delete on that guid in the sqlite, possibly followed by a rebuild, when an osm diary entry gets moderated.

Guessing which entries have been moderated by looking at consecutive rss feeds isn't very practical.

geraldb commented 6 years ago

trigger a delete on that guid in the sqlite

Hello, I'm the pluto dev. If I may think out loud:

Adding the "custom" delete script for guids in ruby should be fairly easy (just a couple of lines). If that's workable I'm happy to put together a sample script to get you started. If I may advertise pluto - that's why pluto is (way) better than lets say planet.py :-) - pluto has an sql database (management system)! - that, makes it easier to manage the data. Happy new year. Prosit 2018! Greetings from Vienna. Cheers.

grischard commented 5 years ago

Happy new year @geraldb ;). Yes, a small script that would somehow force reimporting everything from https://www.openstreetmap.org/diary/ on every run, without affecting all the other sources, would be fantastic. Would it be as simple as a sql delete where source like 'OpenStreetMap User's Diaries'?

geraldb commented 5 years ago

@grischard Happy New Year 2019! (almost). All the data (and posts) get stored in a single file SQLite database e.g. planet.db - it should not be too hard to use a good old sql script to delete spam posts - you can use any tables and and any columns. it's just sql. For the SQL schema, see https://github.com/feedreader/pluto/blob/master/pluto-models/lib/pluto/schema.rb or better open up the planet.db with SQLite Studio or the sqlite console and type .schema, for example. (Of course, you can also use Ruby with the ActiveRecord ORM, for a spam deleting script). Happy to work on making pluto better in the new year (in 2019).

harry-wood commented 5 years ago

I'm making some progress with this issue here upstream : https://github.com/feedreader/pluto/pull/16 Essentially I've implemented @smsm1 's suggestion:

Take the oldest item in the RSS, and if it exists in the DB, remove any items from the DB that are newer than that, which are no longer in the RSS feed.

harry-wood commented 5 years ago

I didn't realise, but we went live with my fix (updated to the latest pluto gem with it) 6 days ago. So...

FIXED!

From now on the only spam you see on https://blogs.openstreetmap.org would be either spam which is still present in the diaries RSS, or spam which was present in the previous update run and will be zapped on the next update run (in <30 min)