spacy-pl / paper

Polishing, benchmarking & other preparation for publishing our work
0 stars 0 forks source link

Lematization pipeline moved to new repo, researched use of Token.tag in lemmatizer #6

Open kowaalczyk opened 4 years ago

kowaalczyk commented 4 years ago

Clean existing pipeline and move it into the new repository. Convert lemmtaizer rules to proposed spaCy JSON format and update PR.

MateuszOlko commented 4 years ago

Screenshot from 2019-09-24 09-39-32 SpaCy internals pass to lemmatization the POS-tag only.

To use richer tags we would have to write our own pipeline component. Suggest: drop the idea and use POS

kowaalczyk commented 4 years ago

While I agree this may be too much work, I'd like to know whether the NKJP information we're trying to put in there would be something we could fit into morphology parameter - spacy docs say that this should also be in UD format, which is described here.

Given that most of these features are inflection-related I believe we should be able to extract at least some of them from NKJP tags. By using more than just UD POS we would surely get better speed (shorter tree to search) and if our rule-to-tag assignment is correct, this would also likely give us better accuracy. This is one issue, the other is complexity of our change:

We'd have to introduce a class deriving from the base lemmatizer, since the base one uses spacy.lemmatizer.lemmatize that unfortunately doesn't get the morphology parameter from the spacy.lemmarizer.Lemmatizer.__call__. The way to do this right is define a PolishLemmatizer class deriving spacy.lemmatizer.Lemmatizer, and just reimplement the __call__ method. Then, in spacy.lang.pl.PolishDefaults override classmethod create_lemmatizer from spacy.language.BaseDefaults.

So, the questions I'm asking are:

kowaalczyk commented 4 years ago

Update: we're letting the full NKJP POS tags remain unused, but let's move the pipeline to the new repo anyway, so that it can be extended when necessary