vaquierm / RedditCommentTextClassification

💬 Classification of Reddit comments to the subreddit they were posted in
0 stars 0 forks source link

Create a vocabulary #3

Closed vaquierm closed 5 years ago

vaquierm commented 5 years ago
vaquierm commented 5 years ago

Make sure that you remove all stop words first before getting the N-Grams

vaquierm commented 5 years ago

@hmarine when you go create the vocabulary, and for example all the words 'run', 'running', and 'ran' are mapped to the single word 'run'

vaquierm commented 5 years ago

Also we want to remove punctuation. @hmarine This link is super useful, it talks about many of the processing steps that are relevant to our project

Found a step by step tutorial on how to do this

vaquierm commented 5 years ago

This has been started in branch vaquierm/vocabulary

vaquierm commented 5 years ago

This is a super helpful link for regex, for example it has this example to map all the words like

lol loool looooool -> lol no nooo noooooooo -> no

image

But

Turns out a lot of what was done here is useless cause like libraries that do all this already exist... oops https://colab.research.google.com/drive/1OBoUxxhxQiCZ72F3_a7z216kADqEFz3u#scrollTo=j2p4yikZy5cg