Closed vaquierm closed 5 years ago
Make sure that you remove all stop words first before getting the N-Grams
@hmarine when you go create the vocabulary, and for example all the words 'run', 'running', and 'ran' are mapped to the single word 'run'
data/raw_data
folder? Same thing for the testing raw datasetAlso we want to remove punctuation. @hmarine This link is super useful, it talks about many of the processing steps that are relevant to our project
Found a step by step tutorial on how to do this
This has been started in branch vaquierm/vocabulary
This is a super helpful link for regex, for example it has this example to map all the words like
lol loool looooool -> lol no nooo noooooooo -> no
Turns out a lot of what was done here is useless cause like libraries that do all this already exist... oops https://colab.research.google.com/drive/1OBoUxxhxQiCZ72F3_a7z216kADqEFz3u#scrollTo=j2p4yikZy5cg