linhd-postdata / spacy-affixes

spaCy support to split affixes for Freeling-like affixes rules and dictionaries
https://spacy-affixes.readthedocs.io
Apache License 2.0
5 stars 1 forks source link

Bug using words like 'Sube' at beginning #18

Open JavierBJ opened 4 years ago

JavierBJ commented 4 years ago

I'm using spacy-affixes as part of the SpaCy pipeline, as explained in the usage guide. It has been working properly until I tried the following sentence: "Sube el paro". When doing nlp("Sube el paro.") I'm getting the following error:

Traceback (most recent call last):
  File "/home/usuario/.local/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 3319, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-21-751769ff6949>", line 1, in <module>
    nlp("Sube el paro.")
  File "/home/usuario/.local/lib/python3.6/site-packages/spacy/language.py", line 435, in __call__
    doc = proc(doc, **component_cfg.get(name, {}))
  File "/home/usuario/.local/lib/python3.6/site-packages/spacy_affixes/main.py", line 163, in __call__
    self.apply_rules(retokenizer, token, rule)
  File "/home/usuario/.local/lib/python3.6/site-packages/spacy_affixes/main.py", line 140, in apply_rules
    token, [*rule["affix_text"], token_sub], heads
  File "_retokenize.pyx", line 88, in spacy.tokens._retokenize.Retokenizer.split
ValueError: [E117] The newly split tokens must match the text of the original token. New orths: subSube. Old text: Sube.

From my experience and tries, I can say the bug happens with texts like:

nlp("Sube el paro.")
nlp("Sube")
nlp("Subir")
nlp("Subiendo")

But not with texts like:

nlp("sube el paro.")
nlp("sube")
nlp("Subasta")
nlp("Subimos")

Given the error thrown, something related to matching prefix "sub" might be messing things up.

My configuration

versae commented 4 years ago

Thanks for reporting, @JavierBJ!

In out experience, prefix splitting can cause more trouble than is worth. We're looking at the problematic Freeling rule (^sub) to figure out a solution. In the meantime, you could try only using suffixes rules (e.g., clitics) if that fits your scenario. We use something like this in other projects:

import spacy 
from spacy_affixes import AffixesMatcher
from spacy_affixes.utils import AFFIXES_SUFFIX
from spacy_affixes.utils import load_affixes

nlp = spacy.load("es") 

suffixes = {k: v for k, v in load_affixes().items()
            if k.startswith(AFFIXES_SUFFIX)} 
affixes_matcher = AffixesMatcher(nlp, split_on=["VERB"], rules=suffixes)
nlp.add_pipe(affixes_matcher, name="affixes", before="tagger")
JavierBJ commented 4 years ago

Thank you very much @versae for your workaround, it solved the problems mentioned. I'll keep an eye on any solutions you find on the Freeling rule issue.

Kind regards