Added Functions - Githubissues

minerva-ml / open-solution-toxic-comments

Open solution to the Toxic Comment Classification Challenge

https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge

MIT License

154 stars 58 forks source link

Added Functions #30

Closed shaz13 closed 6 years ago

shaz13 commented 6 years ago

Anonymize - removes user names/ IP addresses from the comment
Apostophes - replaces are'nt with arenot etc

shaz13 commented 6 years ago

Note: The nltk in the code uses wordnet and stopwords list which should be priorly downloaded in the environment using nltk.download('wordnet') and nltk.download('stopwords'). Are these persistent in Neptune environment or each run requires the download again?

shaz13 commented 6 years ago

@jakubczakon @kamil-kaczmarek Added functionality to choose using stopwords or not. Added nltk.downloads respectively. Awaiting decision on APPO dict placement. PTAL :)

kamil-kaczmarek commented 6 years ago

Hey @shaz13, I have discussed APPO with @jakubczakon, and we have decided to do it in a clean way. That is:

APPO dict goes to external_data/apostrophes.json. Note that it is json file,
In the steps/preprocessing.py you can load it to dict, using json module.

When this is done, we are ready to merge.

Sorry for iterating this multiple times - we just want to maintain clean implementation ;-)

Thanks!

kamil-kaczmarek commented 6 years ago

@shaz13 @jakubczakon merge done.