Closed irinakhismatullina closed 5 years ago
Change train dataset generation schema.
Instead of leaving only identifiers with tokens inside the vocabulary for training, we now use the given number of the most frequent identifiers.
This way we still remove a lot of noise, and also have less blind dependency on the vocabulary.
@irinakhismatullina ./lookout/style/typos/preparation.py:17:1: F401 'lookout.style.typos.utils.filter_splits' imported but unused
./lookout/style/typos/preparation.py:17:1: F401 'lookout.style.typos.utils.filter_splits' imported but unused
Change train dataset generation schema.
Instead of leaving only identifiers with tokens inside the vocabulary for training, we now use the given number of the most frequent identifiers.
This way we still remove a lot of noise, and also have less blind dependency on the vocabulary.