publiclab / plots2

a collaborative knowledge-exchange platform in Rails; we welcome first-time contributors! :balloon:
https://publiclab.org
GNU General Public License v3.0
961 stars 1.83k forks source link

Machine Learning based projects #4660

Open NeuralMonk opened 5 years ago

NeuralMonk commented 5 years ago

Currently, our Spam system is completely manual, but I think, instead of reviewing similar content/posts, we can use Machine Learning algorithms for easing the task.

jywarren commented 5 years ago

Hi @SKashyapD -- can you help me find your SoC proposal? Did it get posted?

NeuralMonk commented 5 years ago

hey, @jywarren Yes, It got posted on Public lab website and You have also reviewed it. SoC proposal Is there any trouble or something?

Thank you!

skilfullycurled commented 5 years ago

No, no, thank you for taking the initiative on an ML thread, @SKashyapD! I'm not sure I'll be able to take that much more initiative on the implementation of a tag recommendation system since I don't have lots of experience in programming with Ruby, however, I really want to second your idea of having a server for this.

One thing I would like to do is to piggy back on your initiative and eventually start a conversation about how to grow a community around ML and data science now that the stats downloads page is coming along. More on that later, I have to actually get back to my own data science project!

NeuralMonk commented 5 years ago

@skilfullycurled it will be great

grvsachdeva commented 5 years ago

Hi @jywarren @SKashyapD @Zengirl2, can we close this issue or anyone want to update it? Thanks!

NeuralMonk commented 5 years ago

Hello everyone Not now. I will start working on this after summer break.

Thank you

On Mon, 1 Jul, 2019, 12:09 PM Gaurav Sachdeva, notifications@github.com wrote:

Hi @jywarren https://github.com/jywarren @SKashyapD https://github.com/SKashyapD @Zengirl2 https://github.com/Zengirl2, can we close this issue or anyone want to update it? Thanks!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/publiclab/plots2/issues/4660?email_source=notifications&email_token=AKVWGHR4OEW2RRVVGHJUWRLP5GRDFA5CNFSM4GRCUPFKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODY5ENWA#issuecomment-507135704, or mute the thread https://github.com/notifications/unsubscribe-auth/AKVWGHTGJVXBOFYM2LCMTYTP5GRDFANCNFSM4GRCUPFA .

grvsachdeva commented 5 years ago

Cool!

NeuralMonk commented 5 years ago

@skilfullycurled @jywarren what are the other small projects we can start to grow the community around data science.

thanks!

budema6 commented 5 years ago

I am a new bee into the ML area.. Was going thru the problem statement as it was interesting. Thanks for posting in details for better understanding. Thanks @skilfullycurled @jywarren @SKashyapD @Zengirl2

skilfullycurled commented 5 years ago

@SKashyapD, thank you for keeping the conversation alive! I'll rejoin as soon as I can. I just started school in a new program so although PL GitHub conversations are one of my favorite was to feel like I'm doing programming work but actually just avoiding it (not a joke), I should probably finish my work first. Still, I was too happy about the conversation to not join in. : )

One project comes immediately to mind:

SPAM: There's currently two problems. The first is the spam we currently get from sign-ups and postings, and the second, spam accounts that were made before there was a more robust moderation system. There's a period of time (can't recall what it is) where there are literally ~300,000 accounts. I think Public Lab is awesome, but that seems a tad inflated. ; ) Additionally, I believe that when users are moderated as spam, they are not removed from the database.

Spam isn't the most exciting task but it'd have a real impact. A) moderating spam is a huge resource train. B) Unless we're able to filter out spam accounts, there really can't be good data science because the data won't be good.

This project has two quasi-FTO's. They aren't FTO's according to the actual definition, but the problem contains some "hello worlds" of data science that would be good for someone who is comfortable with Ruby (I don't think you have to be awesome at it, I hardly knew Python in my first data science class) but wants to get started in data science. And the second is for someone who is comfortable with the fundamental exploratory data analysts tasks and wants to try a simple ML exercise.

I've been collecting a data set of spam.

Project 1: Exploratory data analysis. I started #5450 to discuss non-ML ways to detect spam, and I came up with some guidelines simply by exploring the data. These guidelines could become more robust with more exploratory analysis of a larger dataset. This would be a good way to get familiar with the SciRuby library collection and the fundamentals of data science (using Ruby notebooks, dataframes, selecting data, aggregating results, plotting etc.) As I said, I've been collecting a dataset of spam, but we also need a way to identify past spam because I'm sure the markers have changed over time.

Project 2: Creating a spam/ham classifier. This is why I started the collection actually, so that we'd have enough for the spam part. The harder thing is collecting data for people who are in the ham category. So that's sort of in the Project 1 category, but after we have enough of both, then there are plenty of tutorials for someone to have a nice learning experience.

skilfullycurled commented 5 years ago

My pleasure @budema6, I'm excited about developing a community so it's really thanks to you for your interest!

skilfullycurled commented 4 years ago

Update: I now have enough spam if ever anyone wants to take on training spam/ham classifier for the site. If I recall, I've seen a number of Jupyter notebooks that do this in Ruby. Of course, the data has to be parsed, and we need a ham dataset as well. In any event...

Uzay-G commented 4 years ago

Hey! This topic really interests me and I have made some Natural Language Processing projects with python and the spacy library. I'd love to help out and try applying NLP to spam detection. I'm no expert, but i think I could help :smile:

skilfullycurled commented 4 years ago

@Uzay-G, thanks for reviving this thread. I'm not sure when/how but I'm thinking it might be a good idea to try to have a call. It just seems like there's enough interest in general, and it might be good to just meet each other and see if we can organize ourselves. I'd sort of like to see this become a tool topic just like balloon mapping or spectrometry. And, perhaps at some point even have a separate PL repo for projects the same way mapknitter does.

Anyone at @publiclab/connectors, how are we handling developer open calls these days?

stale[bot] commented 4 years ago

Hi :smile:, this issue has been automatically marked as stale because it has not had recent activity. Don't worry you can continue to work on this and ask @publiclab/reviewers to add "work in progress" label :tada: . Otherwise, it will be closed if no further activity occurs in 5 days -- but you can always re-open it if you like! :100: Thank you for your contributions :raised_hands: :balloon:.

jywarren commented 4 years ago

Sorry about the stalebot message here, it was a mistake! 😅 Can't seem to delete due to a GitHub API issue... strange. Carry on!