Stratified sampling for cross validation

sophieball commented 4 years ago

Naveen implemented his version of this but it was still a bit different -

Since the ratio between 0's and 1's are very skewed, if each time we randomly sample a subset to do cross validation (KFold in sklearn), sometimes we might end up with only 0's. I've replaced KFold with StratifiedKFold so the ratio between 0's and 1's will be preserved. Also I've added back the hyper-parameter tuning.

INFO:root:Best score: 0.9182937270873832.
INFO:root:Removing angry words towards oneself and SE words.
INFO:root:Crossvalidation score after adjustment is
              precision    recall  f1-score   support

         0.0       0.94      0.99      0.97      1099
         1.0       0.78      0.34      0.47       103

    accuracy                           0.94      1202
   macro avg       0.86      0.67      0.72      1202
weighted avg       0.93      0.94      0.92      1202

sophieball commented 4 years ago

I begin to feel a little bit angry at myself now - just some small ML tricks.

sophieball commented 4 years ago

Ok. I lied - I said I would push to upstream after the previous PR was merged.. But I really wish to run the code with stratified split asap.

sophieball commented 4 years ago

@CaptainEmerson Can you run the new code in #56 on G's data? It will take longer than before.

CaptainEmerson commented 4 years ago

Ok, I just had to reset virtualenv.

But now:

Requirement already satisfied: en_core_web_md==2.2.5 from https://github.com/explosion/spacy-models/releases/download/en_core_web_md-2.2.5/en_core_web_md-2.2.5.tar.gz#egg=en_core_web_md==2.2.5 in /usr/local/google/home/emersonm/myproject/lib/python3.8/site-packages (2.2.5)
✔ Download and installation successful
You can now load the model via spacy.load('en_core_web_md')
Traceback (most recent call last):
  File "/usr/local/google/home/emersonm/toxicity-detector/bazel-bin/main/train_classifier_g.runfiles/__main__/main/train_classifier_g.py", line 12, in <module>
    from src import suite
  File "/usr/local/google/home/emersonm/toxicity-detector/bazel-bin/main/train_classifier_g.runfiles/__main__/src/suite.py", line 17, in <module>
    from src import create_features
  File "/usr/local/google/home/emersonm/toxicity-detector/bazel-bin/main/train_classifier_g.runfiles/__main__/src/create_features.py", line 7, in <module>
    from src import text_modifier
  File "/usr/local/google/home/emersonm/toxicity-detector/bazel-bin/main/train_classifier_g.runfiles/__main__/src/text_modifier.py", line 12, in <module>
    nlp = spacy.load("en_core_web_md", disable=["parser", "ner"])
  File "/usr/local/google/home/emersonm/toxicity-detector/bazel-bin/main/train_classifier_g.runfiles/deps_pypi__spacy_2_2_4/spacy/__init__.py", line 30, in load
    return util.load_model(name, **overrides)
  File "/usr/local/google/home/emersonm/toxicity-detector/bazel-bin/main/train_classifier_g.runfiles/deps_pypi__spacy_2_2_4/spacy/util.py", line 169, in load_model
    raise IOError(Errors.E050.format(name=name))
OSError: [E050] Can't find model 'en_core_web_md'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory.

CaptainEmerson commented 4 years ago

https://docs.google.com/document/d/1UjaZeOmyhuSlf2D2pxwkpsTRFUK58pUKquzM7P4BISU/edit

sophieball commented 4 years ago

It looks nice =D I've made a PR to Christian's fork.

sophieball / toxicity-detector

Stratified sampling for cross validation #55