Closed sophieball closed 4 years ago
I begin to feel a little bit angry at myself now - just some small ML tricks.
Ok. I lied - I said I would push to upstream after the previous PR was merged.. But I really wish to run the code with stratified split asap.
@CaptainEmerson Can you run the new code in #56 on G's data? It will take longer than before.
Ok, I just had to reset virtualenv.
But now:
Requirement already satisfied: en_core_web_md==2.2.5 from https://github.com/explosion/spacy-models/releases/download/en_core_web_md-2.2.5/en_core_web_md-2.2.5.tar.gz#egg=en_core_web_md==2.2.5 in /usr/local/google/home/emersonm/myproject/lib/python3.8/site-packages (2.2.5)
✔ Download and installation successful
You can now load the model via spacy.load('en_core_web_md')
Traceback (most recent call last):
File "/usr/local/google/home/emersonm/toxicity-detector/bazel-bin/main/train_classifier_g.runfiles/__main__/main/train_classifier_g.py", line 12, in <module>
from src import suite
File "/usr/local/google/home/emersonm/toxicity-detector/bazel-bin/main/train_classifier_g.runfiles/__main__/src/suite.py", line 17, in <module>
from src import create_features
File "/usr/local/google/home/emersonm/toxicity-detector/bazel-bin/main/train_classifier_g.runfiles/__main__/src/create_features.py", line 7, in <module>
from src import text_modifier
File "/usr/local/google/home/emersonm/toxicity-detector/bazel-bin/main/train_classifier_g.runfiles/__main__/src/text_modifier.py", line 12, in <module>
nlp = spacy.load("en_core_web_md", disable=["parser", "ner"])
File "/usr/local/google/home/emersonm/toxicity-detector/bazel-bin/main/train_classifier_g.runfiles/deps_pypi__spacy_2_2_4/spacy/__init__.py", line 30, in load
return util.load_model(name, **overrides)
File "/usr/local/google/home/emersonm/toxicity-detector/bazel-bin/main/train_classifier_g.runfiles/deps_pypi__spacy_2_2_4/spacy/util.py", line 169, in load_model
raise IOError(Errors.E050.format(name=name))
OSError: [E050] Can't find model 'en_core_web_md'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory.
It looks nice =D I've made a PR to Christian's fork.
Naveen implemented his version of this but it was still a bit different -
Since the ratio between 0's and 1's are very skewed, if each time we randomly sample a subset to do cross validation (
KFold
in sklearn), sometimes we might end up with only 0's. I've replacedKFold
withStratifiedKFold
so the ratio between 0's and 1's will be preserved. Also I've added back the hyper-parameter tuning.