openstreetmap / openstreetmap-website

The Rails application that powers OpenStreetMap
https://www.openstreetmap.org/
GNU General Public License v2.0
2.2k stars 912 forks source link

Omit diaries containing Chinese characters from RSS feed #2218

Closed AndrewHain closed 5 years ago

AndrewHain commented 5 years ago

https://blogs.openstreetmap.org is currently useless and WeeklyOSM/Wochennotiz has disappeared within hours.

tomhughes commented 5 years ago

Clearly this is a ridiculous suggestion.

mmd-osm commented 5 years ago

I think that's really a duplicate of https://github.com/gravitystorm/blogs.osm.org/issues/17

tomhughes commented 5 years ago

Well yes, but my point is that censoring diary entries based on what language they are written in is clearly not something we could countenance.

mmd-osm commented 5 years ago

This isn't going to end soon, says alexkemp: https://www.openstreetmap.org/user/alexkemp/diary/338244

"The easiest method to defeat {put in automated spam tool} is to simply require the first post of any new forum member or blog poster to be approved before it can appear."

tomhughes commented 5 years ago

Yes, because I am really looking forward to having 5000 posts to approve when I get up every morning.

I mean obiovously that is an option, but one that requires significant engineering and is not a quick fix even if it is practical.

mmd-osm commented 5 years ago

Right. Usually, there's only a very small number of non-spam blog posts, and only those would need some approval - and that's for the very first time someone posts a blog only. The others can be automatically purged after a few days, if noone cared to approve them, or user complained that their posts are still not showing up on the page.

Some really low hanging activities could be:

tomhughes commented 5 years ago

Is there are evidence they are actually getting indexes in the few hours before they are removed?

Obviously we can ban posting by new users, but that falls into the category of "collateral damage" that I have just discussed on Alex's latest rant.

mmd-osm commented 5 years ago

Yes, they are showing up on Goog index fairly soon. I tried this yesterday with some random Chinese spam snippets.

tomhughes commented 5 years ago

There is also of course no reason to believe that they wouldn't just add a delay between creating the account and posting.

Frankly I think a more reasoned response would be that it is ridiculous for us to be running a blog system that is entirely unrelated to our primary purpose and just ditch the diaries altogether.

mmd-osm commented 5 years ago

Ah yes, that's kind of the "nuclear option". I was also thinking about shutting down the blog system altogether.

SomeoneElseOSM commented 5 years ago

Yes, because I am really looking forward to having 5000 posts to approve when I get up every morning.

On that specific point, why does it have to be just you that does it? Lots of people have been banging on about diary spam and surely some of those would be appropriate to have as "diary approvers" with the specific job to approve valid posts (and only that). Sure, the system to allow that won't write itself, but there's no reason that any extra effort once set up needs to sit explicitly on the admins.

mmd-osm commented 5 years ago

I think alexkemp was right about the nature of those post - they're currently training their spam bots by posting some random news articles. Let's see how it goes.

mmd-osm commented 5 years ago

Ah, there's an issue with the robots.txt change: you need to remove the trailing slash in

Disallow: /user/*/diary/

i.e. replace this row by

Disallow: /user/*/diary

Otherwise, the user's blog list (e.g. https://www.openstreetmap.org/user/TomH/diary ) still gets indexed.

Test tool I used: https://webmaster.yandex.com/tools/robotstxt/?hostName=https%3A%2F%2Fwww.openstreetmap.org%2Frobots.txt

mvglasow commented 5 years ago

For the moment, if we cannot contain the spam, I would consider removing the user diaries from the feed until we find a way to fix this.

However, looking at the topmost spam post, the related user seems to have been deleted already. So solving #17 (ensuring that diary entries disappear from the blog when the user is deleted) might also solve this issue.

Nakaner commented 5 years ago

https://github.com/gravitystorm/blogs.osm.org/issues/40 is related to this issue.

I agree that simple block rules based on the used characters sets are too simple but just ignoring this problem is not an option either.

tomhughes commented 5 years ago

We're not ignoring anything - we are making ongoing efforts to fight the spam and to add new features to help control it.