minerva-ml / open-solution-toxic-comments

Open solution to the Toxic Comment Classification Challenge
https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge
MIT License
154 stars 58 forks source link

Added Functions #30

Closed shaz13 closed 6 years ago

shaz13 commented 6 years ago
  1. Anonymize - removes user names/ IP addresses from the comment
  2. Apostophes - replaces are'nt with arenot etc
shaz13 commented 6 years ago

Note: The nltk in the code uses wordnet and stopwords list which should be priorly downloaded in the environment using nltk.download('wordnet') and nltk.download('stopwords'). Are these persistent in Neptune environment or each run requires the download again?

shaz13 commented 6 years ago

@jakubczakon @kamil-kaczmarek Added functionality to choose using stopwords or not. Added nltk.downloads respectively. Awaiting decision on APPO dict placement. PTAL :)

kamil-kaczmarek commented 6 years ago

Hey @shaz13, I have discussed APPO with @jakubczakon, and we have decided to do it in a clean way. That is:

  1. APPO dict goes to external_data/apostrophes.json. Note that it is json file,
  2. In the steps/preprocessing.py you can load it to dict, using json module.

When this is done, we are ready to merge.

Sorry for iterating this multiple times - we just want to maintain clean implementation ;-)

Thanks!

kamil-kaczmarek commented 6 years ago

@shaz13 @jakubczakon merge done.