Create a vocabulary - Githubissues

vaquierm / RedditCommentTextClassification

💬 Classification of Reddit comments to the subreddit they were posted in

0 stars 0 forks source link

Closed vaquierm closed 5 years ago

vaquierm commented 5 years ago

Generate a full dictionary with N-Grams (Probably only up to two)
Remove all Stop words from the dictionary
Lemmatize the whole thing with NLTK
Create some custom Lemmatization such as youtube links. Make a custom word for youtube links
Create custom Lemmatization for weird smileys like （╹◡╹or ᕦ(ò_óˇ)ᕤ or ヽ(＾ω＾)ﾉ. I feel like this shit is gonna be all over the anime subreddit. We can map all these things to weird anime smiley or something

vaquierm commented 5 years ago

Make sure that you remove all stop words first before getting the N-Grams

vaquierm commented 5 years ago

@hmarine when you go create the vocabulary, and for example all the words 'run', 'running', and 'ran' are mapped to the single word 'run'

The vocabulary will contain the word 'run'.
At this point will you be able to change all the words 'ran' and 'running' from all comments in the raw data and save it to a different file called reddit_train_cleaned.csv in the data/raw_data folder? Same thing for the testing raw dataset

vaquierm commented 5 years ago

Also we want to remove punctuation. @hmarine This link is super useful, it talks about many of the processing steps that are relevant to our project

Found a step by step tutorial on how to do this

vaquierm commented 5 years ago

This has been started in branch vaquierm/vocabulary

vaquierm commented 5 years ago

This is a super helpful link for regex, for example it has this example to map all the words like

lol loool looooool -> lol no nooo noooooooo -> no

Turns out a lot of what was done here is useless cause like libraries that do all this already exist... oops https://colab.research.google.com/drive/1OBoUxxhxQiCZ72F3_a7z216kADqEFz3u#scrollTo=j2p4yikZy5cg