publiclab / plots2

a collaborative knowledge-exchange platform in Rails; we welcome first-time contributors! :balloon:
https://publiclab.org
GNU General Public License v3.0
960 stars 1.83k forks source link

Spam account detection/reduction planning #5450

Open skilfullycurled opened 5 years ago

skilfullycurled commented 5 years ago

Hi,

I've looked at a lot of notes about spam, and it seems there are a few solutions in progress, but I couldn't see any full implementation. If there is, please close. This comment could also be merged with, https://github.com/publiclab/plots2/issues/3798, https://github.com/publiclab/plots2/issues/2819, https://github.com/publiclab/plots2/issues/4321, https://github.com/publiclab/plots2/issues/4323, https://github.com/publiclab/plots2/issues/4966, https://github.com/publiclab/plots2/issues/2321.

Short story: I've been working with the user data from the new stats download functionality and in needing to remove as much spam as possible, I did a bit of exploratory data analysis. While it's obviously ideal to have a more robust solution (see issues above), the breakdown below indicates that we could probably significantly reduce the amount of user registration spam (maybe as much as 82%) if we leverage the fact that putting a URL in the bio text is a huge indicator of spam. See below for derivation, but given the following assumptions, of 850 users with a URL in their bio out of a total 1048 registered users, only one appears to be an actual user.

One idea would be to add a separate text box for a single URL. If upon validation, there is either more than one url or any non-url text, then the registration is rejected. To avoid having to add a field to the database, we could simply append the URL to the end of the text sort of like a signature before it gets sent to the database. Or perhaps a textbox in which they have to retype the url they've added to their bio. In both cases, it limits the user to just one url but I think that's a reasonable trade off.

PREMISES TO DETECT SPAM ON REGISTRATION:

  1. Spam users have bios (we wouldn't know otherwise unless they post)
  2. Real users are less likely to have url's in their bio.
  3. If a real user has a url, it is less likely that the bio begins with the URL.
  4. Real users are less likely to have the total length of their bio greater than 1000 (based on some distributions, I'll post to the site eventually).

STARTING USERS: 1048 NOTE: Probably not very randomized, just the head and tail of the entire set. Also, not sure how large a sample I actually need.

USERS W/ TEXT IN BIO: 941 HAS URL IN BIO: 850
NO URL IN BIO: 91

HAS URL IN BIO: 850 BEGINS WITH URL: 522 (assume low probability of humans beginning bio with url) DOES NOT BEGIN W/ URL: 328

DOES NOT BEGIN W/ URL: 328 SPAM: 327 (read through all) ACTUAL USER: 1

NO URL IN BIO: 91 SPAM: 60 (read through all 91, they were all about fencing, the house type not the sport) ACTUAL USERS: 31

jywarren commented 5 years ago

This is great, thanks so much! I wanted to ask a couple things that needn't block but could ease the way forward potentially --

  1. is there a way we could warn real people in a nice way that they've been banned and can appeal it, so if we know that we're likely to have mis-banned, for example, 5 of 5000 (or whatever) accounts, those 5 people have recourse? Because I think that'd potentially be worth the trouble, though we can discuss.
  2. Are there any additional rules we might apply to narrow things even more? Like, a rule we can empirically derive from the analysis you did above -- anything that 1 user had that otherwise distinguishes them from the rest?
  3. Of some of these stats, had anyone posted anything besides their profile? any comments or notes? (maybe this is hard to tell from the data you have)
  4. Would you be comfortable sharing a https://gist.github.com/ list of filtered UIDs so we can use that in future analyses?

Thanks, Benjamin!!!!!

On Wed, Apr 10, 2019 at 2:32 PM skilfullycurled notifications@github.com wrote:

Hi,

I've looked at a lot of notes about spam, and it seems there are a few solutions in progress, but I couldn't see any full implementation. If there is, please close. This comment could also be merged with, #3798 https://github.com/publiclab/plots2/issues/3798, #2819 https://github.com/publiclab/plots2/issues/2819, #4321 https://github.com/publiclab/plots2/issues/4321, #4323 https://github.com/publiclab/plots2/issues/4323, #4966 https://github.com/publiclab/plots2/issues/4966, #2321 https://github.com/publiclab/plots2/issues/2321.

Short story: I've been working with the user data from the new stats download functionality and in needing to remove as much spam as possible, I did a bit of exploratory data analysis. While it's obviously ideal to have a more robust solution (see issues above), the breakdown below indicates that we could probably significantly reduce the amount of user registration spam (maybe as much as 82%) if we leverage the fact that putting a URL in the bio text is a huge indicator of spam. See below for derivation, but given the following assumptions, of 850 users with a URL in their bio out of a total 1048 registered users, only one appears to be an actual user.

One idea would be to add a separate text box for a single URL. If upon validation, there is either more than one url or any non-url text, then the registration is rejected. To avoid having to add a field to the database, we could simply append the URL to the end of the text sort of like a signature before it gets sent to the database. Or perhaps a textbox in which they have to retype the url they've added to their bio. In both cases, it limits the user to just one url but I think that's a reasonable trade off.

PREMISES TO DETECT SPAM ON REGISTRATION:

  1. Spam users have bios (we wouldn't know otherwise unless they post)
  2. Real users are less likely to have url's in their bio.
  3. If a real user has a url, it is less likely that the bio begins with the URL.
  4. Real users are less likely to have the total length of their bio greater than 1000 (based on some distributions, I'll post to the site eventually).

STARTING USERS: 1048 NOTE: Probably not very randomized, just the head and tail of the entire set. Also, not sure how large a sample I actually need.

USERS W/ TEXT IN BIO: 941 HAS URL IN BIO: 850 NO URL IN BIO: 91

HAS URL IN BIO: 850 BEGINS WITH URL: 522 (assume low probability of humans beginning bio with url) DOES NOT BEGIN W/ URL: 328

DOES NOT BEGIN W/ URL: 328 SPAM: 327 (read through all) ACTUAL USER: 1

NO URL IN BIO: 91 SPAM: 60 (read through all 91, they were all about fencing, the house type not the sport) ACTUAL USERS: 31

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/publiclab/plots2/issues/5450, or mute the thread https://github.com/notifications/unsubscribe-auth/AABfJ-1n3xL_IrnAfXKO20AUW19r5vW4ks5vfi42gaJpZM4cn_BS .

skilfullycurled commented 5 years ago
  1. is there a way we could warn real people in a nice way that they've been banned and can appeal it...

I'm sure, especially if they haven't posted any content since then their isn't any loss of contribution. Off the cuff maybe we just say the account is on hold until we hear from them. Something like, "Our system has flagged your profile as potentially being spam. We have placed temporary hold on your account. We apologize if this is in error. If it is, to remove the hold, please reply to this email and [something to do]. If we do not hear from you in X days, your account will deleted." Then maybe we send them two emails?

  1. Are there any additional rules we might apply to narrow things even more?'

Ummmm...well, there was that fencing thing...but seriously folks, yes, I'll need to do a bigger sample but I run some n-gram counts another time to see if there are one or two word groupings that are very popular. Upon first glance, I think some there are some subtle tiny words: we, us, he, she, our(s), they, your, has, have, is, been. There's a general tendency to not talk in the first person and not talk about the present.

"We have the best..." "He will give you great service" "We have been around for..." "Our service can give you..."

When there is writing about one's self, it doesn't seem to include things such as "I am" or "I have" (I'll need to test a bigger sample):

"Kite aerial photographer with, and co-founder of, West Lothian Archaeology, a non-commercial community group: nhttp://www.armadale.org.uk/phototech.htm" or just 'spectrography enthuziast"

  1. Of some of these stats, had anyone posted anything besides their profile? any comments or notes?

That's the next step for me anyway. I'll be merging the users and other tables and I've added columns to the sample users (and can add to larger samples) for the presence of a URL and the length of the bio.

  1. Would you be comfortable sharing a https://gist.github.com/ list of filtered UIDs so we can use that in future analyses?

Of course! The dev team is actually the one I would want to be comfortable with it and soon the data will be available anyway so absolutely.

grvsachdeva commented 5 years ago

First of all, really awesome analysis @skilfullycurled!

Ummmm...well, there was that fencing thing...but seriously folks, yes, I'll need to do a bigger sample but I run some n-gram counts another time to see if there are one or two word groupings that are very popular. Upon first glance, I think some there are some subtle tiny words: we, us, he, she, our(s), they, your, has, have, is, been. There's a general tendency to not talk in the first person and not talk about the present. "We have the best..." "He will give you great service" "We have been around for..." "Our service can give you..."

I was given moderator privilege a year back for testing purposes but after some time I have to redirect all the moderation emails to the bin as they were just filling up my inbox and also time issue. But, I observed that most of the research notes which enter moderation have some common keywords like Sex, Casino, escort, etc. in them. We can also consider the profile banning on the basis of occurrences of such keywords.

Regarding filtering on the basis of URL seems a good idea as most businesses drop their contact info in bio, but at the same time, genuine users might also get targetted. Would you be willing to come to site again or feel offended if you're directly banned? Yes!

So, I like this idea of sending emails-

apologize if this is in error. If it is, to remove the hold, please reply to this email and [something to do]. If we do not hear from you in X days, your account will be deleted." Then maybe we send them two emails?

We can automate this process maybe with mailman?

Thanks!

skilfullycurled commented 5 years ago

@gauravano, sorry, for some reason I didn't see this response. That's good info, and it would be easy enough to break up the spam into individual words and sliding window word pairs to gather a list of the top most frequent words. Interestingly, in the tag graph I posted, I removed a clustering which was about escorts. I assumed that was not one of our new DIY initiatives.

I have a question about what happens during moderation process: when someone clicks on the garbage can, does it delete the user as well as the spam posting? Right now, I'm still able to go to the spam user's profile but I didn't know if that would be the case after a moderator clicks on the garbage can.

I ask because if I had 1000 postings for which I also knew the user, that would really help. The postings I can keep because I'll get the emails and can parse those later.

grvsachdeva commented 5 years ago

No issue!

Ok, so you can explore all the things at admin_controller.rb. Garbage icon refers to delete I guess as we use another icon for spam. can you put image so I can tell exactly? There's no option to delete user for now. But, yes, there's an option to delete a note so you'd be deleting the spam note on clicking the garbage icon.

On Sat, Apr 27, 2019 at 2:35 AM skilfullycurled notifications@github.com wrote:

@gauravano https://github.com/gauravano, sorry, for some reason I didn't see this response. That's good info, and it would be easy enough to break up the spam into individual words and sliding window word pairs to gather a list of the top most frequent words. Interestingly, in the tag graph I posted, I removed a clustering which was about escorts. I assumed that was not one of our new DIY initiatives.

I have a question about what happens during moderation process: when someone clicks on the garbage can, does it delete the user as well as the spam posting? Right now, I'm still able to go to the spam user's profile https://publiclab.org/profile/greencoffeeforweightloss but I didn't know if that would be the case after a moderator clicks on the garbage can.

I ask because if I had 1000 postings for which I also knew the user, that would really help. The postings I can keep because I'll get the emails and can parse those later.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/publiclab/plots2/issues/5450#issuecomment-487201041, or mute the thread https://github.com/notifications/unsubscribe-auth/AE7JF5SGFBIHFN2GB6WDYPLPSNVB5ANCNFSM4HE76BJA .

skilfullycurled commented 5 years ago

@gauravano that's perfect actually. We can continue deleting spam content, but since I'll have the email, I'll still be able to associate them with a user later.

skilfullycurled commented 5 years ago

@jywarren here's the gist you requested.

jywarren commented 5 years ago

Let's start to narrow towards an initial pass, starting with the easiest to deal with. I think the order might be, easiest to hardest:

...who else? I'm considering factors like:

skilfullycurled commented 5 years ago

I think the factors you're considering are really smart. Perhaps it would be good for me to confirm their implications by getting some figures. On the extreme, if say, item 2 only contains 5 users, then it won't be a good test group.

Some follow-up questions/thoughts:

1) I'm not clear how criteria 1 and 3 are different. My interpretation is that if 1 is a subset of 3. If you're looking for feedback on what percentage of spam not logging in ever accounts for that could be ascertained in afterwards.

2) Can you say more about the 6 month time constraint on spam (item 2)? Is it that if we did it for longer, we'd be sending out a massive amount of email and in the worst case, getting lot of appeals?

3) Do you know offhand what indicates spam content (e.g., a "1" in the "is_spam" field)?

jywarren commented 5 years ago

Hi @skilfullycurled -- thanks! Yeah, if it requires confirmation from a person, we can't write a query for it, true. I guess the additional factor could prove useful further down the list as we get all the low-hanging fruit, and if we get past the point where we can easily write discrete queries... maybe. We could imagine a time window where we say leave a comment on this post before _____ date and your account won't be "trimmed", which turns an ask for confirmation into a query-able factor (i.e. once they leave a comment -- or subscribe to a list -- we won't try to trim them any more).

6 months, i dunno. We have ~10 years of accounts. I figure we can not feel bad at 6m and then could tighten that up iteratively to closer timeframes. But it's a bit arbitrary.

  1. for "is_spam" i just meant it's been marked as spam by a moderator, so that would miss a lot, but it's easy to query for. This is thinking about how there are lots of accounts that have generated spam that we still maintain records for. The downside -- we lose easy access to the email record and content of the spam. Solution - we could archive the spam for analysis purposes, just not keep it in the database.
jywarren commented 5 years ago

I think we've attempted to collect stats on some of these variations in the past but am not sure where. In an issue. If we could pull the Rails queries for these that'd be useful... hmm.

https://github.com/publiclab/plots2/search?p=2&q=spam+accounts&type=Issues maybe?

Here's some useful ones: https://github.com/publiclab/plots2/issues/4 (classic!) and https://github.com/publiclab/plots2/issues/974 (comprehensive!)

skilfullycurled commented 5 years ago

Thanks for the clarifications! One remaining: I think one of us is misunderstanding either my first question or your answer to it, although either way your answer brings up an excellent point.

To clarify my first question, I was just wondering if items 1 and 3 differ or by just doing item 3 we will also be doing item 1?

To me they read:

1) Not A, B, C, or D 3) Not A, B, C, and D or not D

skilfullycurled commented 5 years ago

Oh, on your third point responding to my "is_spam" question, what I meant was do you know off hand what field in our database connotes that something was marked as spam?

jywarren commented 5 years ago

Ah yes, um for nodes, it's status == 0 and for users too, i believe!

jywarren commented 5 years ago

@skilfullycurled maybe your question is about the checkbox checklist, and not the criteria? I think that may be my confusion? If so, then yeah you are correct, (3) contains (1) but (1) is easier to determine as it's only one column, and we may get an immediate win from that. Hope that's what you're asking!

jywarren commented 5 years ago

Whereas (3) is a join!

skilfullycurled commented 5 years ago

@jywarren, yes, exactly we're in Q&A sync now. : )

Okay. I'll add this to a list of things. If the month of June were a server, it was recently moved to the subdomain "unstable" so I'll try to work on it as I can.

phoenixbird357 commented 5 years ago

Hey there!

First of all, let's clarify some points.

Why do we care about spam users ?

Because we don't want them to share spam content anywhere on our portal. If someone creates spam account with spam bio or anything but doesn't share spam content, then it won't hurt us. We can just think of them as just another fake account and ignore them.

But if they share any spam content, that's red flag. We must develop some way to block them.

I'm not at all aware of amount of community and all those stats about public lab. I saw that whenever someone clicks on "Flag as spam" button on posts, it directly asks one to send email to the moderators. <---- this is really bad way. Instead it should be stored internally in the database.

I wanna ask moderators, in how much time period we get spam flags for spam posts and how many ?

Let's say if we get 10 spam flags in period of like 1 hr, then it's quite easy to develop automatic spam blocking system which will have low false positive.

If we try to build our own spam detection model based on inspection of content, I would say, it probably will have high false positive as well as negative rate. Instead I would recommend about using existing services like akismet anti-spam to evaluate the content.

skilfullycurled commented 5 years ago

@phoenixbird357, these are very good considerations. Thank you and @jywarren for taking a step back and asking how the implications of these questions will effect design.

Note: The original post does place an emphasis on future detection, but this issue is also about past detection for reasons I cover below.

Why do we care about spam users ?

The stats page provides user data in raw terms and over time, and that data might be used for things such as writing grants or presentations. Spam users drastically inflate that data by at least 300,000 registrations, an extremely conservative estimate.

I haven't moderated spam in a while, (just collecting for data) so I don't recall the feature you mentioned but I've always assumed moderator communication was for ambiguous content/reevaluation...?

My understanding is that content can be marked in the database or it can be deleted.

Amount of spam varies here are the number of "first time posts" per day which are moderated. Once a first time post is approved, subsequent posts by that users are not seen by a moderator.

Thursday, June 6, 2019: 52 Friday, June 7, 2019: 13 Friday, June 8, 2019: 16 Sunday, June 9, 2019: 3 Monday, June 10, 2019: 17 Tuesday, June 11, 2019 (today): 55 already!

Let's say if we get 10 spam flags in period of like 1 hr, then it's quite easy to develop automatic spam blocking system which will have low false positive.

Absolutely. We did this on another site I work on and I think the main qualification is that a user posts some number of posts in the span of a minute and our load now is nearly zero.

I didn't mean to imply developing any models, although I admit there are hints of this : ). I think the best strategy is to prevent spam by preventing registration of spammers. However, ReCaptcha wasn't working (#4323) so I thought a simple substitute would be to ask registrants to retype a URL since 850 spam users out of 1048 use one.

I completely agree that we should not re-invent the wheel. Although, I disagree with the accuracy rates of trained spam/ham models. : )

EDIT: explained moderation stats

skilfullycurled commented 5 years ago

For future spam detection, I found this ruby gem invisible captcha which seems pretty easy to implement.

skilfullycurled commented 5 years ago

I saw that whenever someone clicks on "Flag as spam" button on posts, it directly asks one to send email to the moderators. <---- this is really bad way. Instead it should be stored internally in the database.

@phoenixbird357, I see what you're saying now. You're not talking about moderators, you're talking about when users mark something as spam. Hold on, I want to see this in action...

skilfullycurled commented 5 years ago

Okay, I just tried it out and but haven't received the email yet. Not sure how often things have been flagged in the past but I imagine that it's exceedingly rare since in order for it to be in a position to be flagged, it first must be approved by a moderator and it's spam like nature has to be so subtle as to be mistaken for a legitimate post. So it's possible that's an edge case for which it is a better use of development time to simply use a mailto and let the moderators manually mark it as spam (which does go in the database) rather than implement a more robust flagging feature.

phoenixbird357 commented 5 years ago

The stats page provides user data in raw terms and over time, and that data might be used for things such as writing grants or presentations. Spam users drastically inflate that data by at least 300,000 registrations, an extremely conservative estimate.

Oh, I'm not sure what's used for "writing grants or presentations" but I feel that should be based on weekly active users through some analytics service, not based on no. of users because apart from the spam users, there are fake and inactive accounts.

I think the main qualification is that a user posts some number of posts in the span of a minute

By this, do you mean "users report some number of posts as spam in a span of a minute" ?

I think the best strategy is to prevent spam by preventing registration of spammers. However, ReCaptcha wasn't working (#4323) so I thought a simple substitute would be to ask registrants to retype a URL since 850 spam users out of 1048 use one. For future spam detection, I found this ruby gem invisible captcha which seems pretty easy to implement.

Oh ? I thought this was about blocking spam from the humans, not bots. If we have to block the bots, then that's quite easy, any kinda captcha works effectively. ReCaptcha should work fine, I may try to debug it out some time.

Although, I disagree with the accuracy rates of trained spam/ham models. : )

Yes if accuracy of trained model is good, it's a win situation for everyone :)

I see what you're saying now. You're not talking about moderators, you're talking about when users mark something as spam.

Yep

Not sure how often things have been flagged in the past but I imagine that it's exceedingly rare since in order for it to be in a position to be flagged, it first must be approved by a moderator

Oh, I didn't know it must be approved by moderator to post anything. Isn't that large amount of work on moderators ? I mean how many posts are created per day ? And all those posts are manually evaluated by moderators which involves lot of effort.

skilfullycurled commented 5 years ago

Grants/presentations was just an example, other reasons will likely surface. It also strains the server to go through unnecessary users which is an ongoing problem (#5524). Either way we still need to remove ~6 +/- years of past spam users.

Some of the following responses may no longer be relevant now that we've clarified flagging/pre-moderated and bots/humans but I'll respond to them anyway.

By "a user posts some number of posts in the span of a minute" I meant that it "catches" users who post five distinct research notes within less than a minute.

Oh ? I thought this was about blocking spam from the humans, not bots.

It's both but I think the idea is that if we block the bot users, then that will cut down on the need to moderate spam made by humans. You bring up an excellent point though, we do not know how much spam is by bots. Implementing any "prove your human" intervention may do nothing at all.

Oh, I didn't know it must be approved by moderator to post anything... And all those posts are manually evaluated by moderators...

No, no. I wasn't clear in the stats above. Once your first post is approved, none of your subsequent posts are subject to moderation in advance. The stats above indicate how many pieces of content a moderator must attend to. I've updated the stats above to be more specific.

...which involves lot of effort.

Despite the clarifications above, yes, I still think it's a lot of effort. It's certainly not great to have your attention interrupted by it.

skilfullycurled commented 4 years ago

Update: I now have enough spam that we could probably make progress on this if anyone is interested.

steviepubliclab commented 4 years ago

Interested!

skilfullycurled commented 4 years ago

Bringing in @Uzay-G who has expressed interest in this domain. @Uzay-G, take a read, and if it's of any interest, even if you just want to play around, let us know. It's something that would be very useful not only to having correct stats, but also, so that the Public Lab data could be used for future ML projects. And, since it's not integrated with the website, you can use whatever tools you want. You mentioned in the ML thread that you've worked with Spacy which I've also been interested in for named entity recognition. It seems like a really awesome library!

Uzay-G commented 4 years ago

Yeah this seems like an interesting subject and I would like to participate. I have used spacy for Natural Entity Recognition and I think I could apply it to this problem but I don't have much experience with data analysis. I am searching for other characteristics we could identify that are found in many of the spam messages.

skilfullycurled commented 4 years ago

Wow, I’m super impressed you did some NER already.

When I was going to attempt this issue, I found out that the exercise of building a spam detector is called “spam vs. ham”. Apparently, it’s a pretty canonical first introduction to NLP. There are a lot of tutorials on how to build a spam/ham classifier from scratch. There are also tutorials that use ones people already built but that’s probably not as as fun ; ).

Alternatively, if you wanted to build one as though you were the first human to ever attempt the project, you can experiment by looking at this flow chart which I’ve used to choose the most appropriate ML approach given my goals and amount of data.

For the data, we have to do two parts:

The emails I’ve collected don’t have the text of the spam, only the username and the fact that the content was marked as spam. So, the emails need to be parsed for the email and username which can then you can do a query on the data which is in a csv.

I have the emails which I can save as an MBOX. Then what I’ve done is used an MBOX parser, might be a part of the standard Python API. I have an example from a different project if you run into problems.

It’s possible the current data will not classify the older spam, and if that’s the case (or you just want to) we could get more data by filtering the older data for features that are common to all spam. There’s a list of them above in this thread (I think?) or one of the spam threads you commented on. I can try to find it later if you can’t.

Uzay-G commented 4 years ago

Yes I think it would be a good idea for me to check out those tutorials and then try to build a spam vs ham classifier that would be personalized to detect spam symptoms moderators have noticed on Publiclab. Once I have a working prototype that can classify spam, I will try to integrate my classifier with Publiclab data and see how it works!

I will get started when I have some time :+1:

skilfullycurled commented 4 years ago

Actually, now that I think about it, that was before I learned there was a moderation dashboard which had links to the actual comments/profiles that were marked as spam. If we got a csv of that query, you might not have to parse anything, you’d only need to cross reference the identifier with the sites content data to get the text.

@jywarren, is it possible to run the same query for the moderation spam list page, but save it to a csv instead of running it through the template?

Uzay-G commented 4 years ago

Wait so how could I access the spam profile text in my program. Do I have to do web-scraping to get it?

skilfullycurled commented 4 years ago

Oh, sorry, @Uzay-G, I made an assumption in my explanation. We would give you all of the data you would need.

jywarren commented 4 years ago

Hi, it would be even easier to provide the feed in json, if that could work? Csv is also possible but takes some work to encode. Thanks!

On Sun, Jan 12, 2020, 3:04 PM Benjamin Sugar notifications@github.com wrote:

Oh, sorry, @Uzay-G https://github.com/Uzay-G, I made an assumption in my explanation. We would give you all of the data you would need.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/publiclab/plots2/issues/5450?email_source=notifications&email_token=AAAF6J5M5AG2FEPK3VHRGTLQ5NZBRA5CNFSM4HE76BJKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEIXCXSI#issuecomment-573451209, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAF6JYFUUYOIZUAUGXYON3Q5NZBRANCNFSM4HE76BJA .

skilfullycurled commented 4 years ago

Yes, whatever is easiest for you so long as it’s in a format which is machine readable, preferably without having to write our own parser!

Oh, and no classless HTML.

Don’t think I didn’t see your hand moving for that button @jywarren!

: P

SidharthBansal commented 4 years ago

@keshavsethi is interested in spam. Please take a look on this issue Keshav

skilfullycurled commented 4 years ago

Hey @keshavsethi,

That’s great! @Uzay-G is interested as well. There’s plenty to do. Give this thread a read, it has a lot of good resources you’ll need.

There are a few things that need to be done or could be done:

It would be really helpful to excitement with a quick way to sort spam from the past users (see above). Doesn’t have to involve machine learning. We’ll also need to send out letters to those past people letting them know and who to contact if this is an error.

The for future spam, you can either each try a different spam/ham module, or the same one and see how well you can get it to classify and then combine your classification features to see if you can improve it.

Or try using askismet and see how it performs on the old users.

Both of these will eventually need to be integrated into the site as well.

So lots to do, as soon as we get the data from @jywarren (no rush Jeff!), we can decide the next steps.

jywarren commented 4 years ago

Sorry about the stalebot message here, it was a mistake! 😅