spacy-pl / utils

Utilities for development polish language support for Spacy.
6 stars 1 forks source link

Repro all models on GCP with lemmatization turned off #53

Closed kowaalczyk closed 5 years ago

kowaalczyk commented 5 years ago

Summary of changes:

EDIT: Now that improved train is merged, we only need to disable lemmatization To reproduce everything, you need the spacy version created by:

git fetch --all
git checkout pl-our-tagmap

followed by a patch to the lemmatizer that effectively disables it:

diff --git a/spacy/lang/pl/lemmatizer/lemmatizer.py b/spacy/lang/pl/lemmatizer/lemmatizer.py
index 7c07ba3a..a069c37b 100644
--- a/spacy/lang/pl/lemmatizer/lemmatizer.py
+++ b/spacy/lang/pl/lemmatizer/lemmatizer.py
@@ -16,6 +16,7 @@ class PolishLemmatizer(object):
         self.lookup_table = lookup if lookup is not None else {}

     def __call__(self, string, univ_pos, morphology=None):
+        return [string.lower()]
         if univ_pos in (NOUN, 'NOUN', 'noun'):
             univ_pos = 'noun'
         elif univ_pos in (VERB, 'VERB', 'verb'):

I'm not pushing this junk to our spacy fork on purpose.

@Gizzio I need your review on this since you need it for the demo backend asap, it shouldn't take long. Also, make sure to check whether it closes #48 or not, I am marking it so that it does but we will need to do this once again after the new lemmatizer is complete.

kowaalczyk commented 5 years ago

@Gizzio I have fixed everything you wanted and rebased this against latest spacy and utils master branches. The POS Tagger and Parser treining were left untouched since review, as for the NER I have checked that the training completes successfully after these changes.

I am merging this branch now.