pablodms / spacy-spanish-lemmatizer

Spanish rule-based lemmatization for spaCy
MIT License
37 stars 6 forks source link

Inconsistent lemmatization #1

Closed lmorillas closed 4 years ago

lmorillas commented 4 years ago

With the PRON - VERB structure it works:

In [106]: texto = "Yo compro manzanas"                                                                                                      
In [107]: doc = nlp(texto)                                                                                                                  
In [109]: for k in doc:  
     ...:     print(k.text, k.pos_, k.lemma_)  

Yo PRON yo
compro VERB comprar
manzanas ADJ manzanas

But not with a VERB + ADJ structure

In [113]: texto = "compro manzanas"                                                                                                         

In [114]: doc = nlp(texto)                                                                                                                  

In [115]: for k in doc:  
     ...:     print(k.text, k.pos_, k.lemma_)  
     ...:                                                                                                                                   
compro PROPN compro
manzanas ADJ manzanas

It fails with a more complex structure too:

In [110]: texto = "Yo compro manzanas, pero a veces compro peras"                                                                           
In [111]: doc = nlp(texto)                                                                                                                  
In [112]: for k in doc:  
     ...:     print(k.text, k.pos_, k.lemma_)  
     ...:                                                                                                                                   
Yo PRON yo
compro PROPN compro
manzanas ADJ manzanas
, PUNCT ,
pero CONJ pero
a ADP a
veces NOUN vez
compro NOUN compro
peras ADJ peras

Have you tested it?

pablodms commented 4 years ago

Hello @lmorillas,

Thanks for writing.

The developed lemmatizer needs a correct inferred tag to extract the proper lemma. In your first example, the word "compro" is tagged as VERB, so it is well lemmatized. But in the second and third example, it is tagged as NOUN and PROPN (PROPER NOUN) so the lemmatizer cannot properly extract its corresponding lemma. The same is true, for example, with the words "manzanas" and "peras", which are tagged as ADJ (adjective) but both are NOUNs. If they were tagged as NOUN, they would have been properly lemmatized as "manzana" and "pera" respectively.

An accurate implementation of the tagger is out of the scope of this package.

I'm open to suggestions.