Exploring ML for building a more robust and scalable version of Kindly

nathanfletcher commented 3 years ago

Looking into ways to achieve what Kindly #41 does to make it better. This may also result in solutions that are not vendor-locked. Maybe @nathanbaleeta may have a few ideas.

lacabra commented 3 years ago

@nathanfletcher: can you document here some of the findings? Thanks! 🙏

nathanfletcher commented 3 years ago

I will start here with the basics from discussions with @nathanbaleeta

A number of things I'll be looking into:

Data sourcing. Leverage Twitter API to get access to training data.
Natural Language Processing/ Understanding. Leverage Natural Language Toolkit(NLTK) or spaCy (open source python based natural language processing libraries) for data preprocessing before fitting the cyber bullying model.
Machine Learning Algorithms. Build on Scikit-learn (Open-source Python-based ML library) which ships with several implementations of ML algorithms right out of the box to build and evaluate the cyber bullying model. Explore shallow learning as a proof of concept as we try to collect enough data before embarking on deep learning methods to achieve state-of-the-art results in the long run.
AI/ ML technology stack: Python, Scikit learn, Pandas, NLTK, spaCy, TextBlob, Numpy, Keras, Tensorflow, Jupyter notebooks, Colab, Tensorboard & FastAPI.

nathanbaleeta commented 3 years ago

PROBLEM DEFINITION The use of Twitter and social networking sites (SNS) such as Facebook to communicate with one another and the world, has led to increased instances of cyberbullying, especially among teenagers. (Reference)

Twitter is an American microblogging and social networking service on which users post and interact with messages known as "tweets". Registered users can post, like, and retweet tweets, but unregistered users can only read them. (Wikipedia)

Cyberbullying is the use of information and communication technology to harass and harm in a deliberate, repetitive, and hostile manner.

Types of cyberbullying include bullying someone through social media, harassment, sexting, cyberstalking, deception, impersonation, and sending nasty messages via chat rooms and instant messenger. Here are more examples of cyberbulling.

According to Twitter demographics published by www.statista.com as of April 2021: users aged less than 24 years old were almost the 24 percent worldwide as shown below in the graphic: statistic_id283119_twitter_-distribution-of-global-audiences-2021-by-age-group

SOLUTION To solve this problem, we will follow the typical machine learning pipeline. We will first import the required libraries and the dataset. We will then do exploratory data analysis to see if we can find any trends in the dataset. Next, we will perform text preprocessing to convert textual data to numeric data that can be used by a machine learning algorithm. Finally, we will use machine learning algorithms to train and test our sentiment analysis models

nathanfletcher commented 3 years ago

@lacabra This repository is where my files and practical learnings are https://github.com/nathanfletcher/ml_text_classification

amreenp7 commented 2 years ago

@nathanfletcher to include this in documentation before closing it.

unicef / publicgoods-roadmap

Exploring ML for building a more robust and scalable version of Kindly #57