stanfordnlp / stanza

Stanford NLP Python library for tokenization, sentence segmentation, NER, and parsing of many human languages
https://stanfordnlp.github.io/stanza/
Other
7.21k stars 886 forks source link

Single words tend to be over-segmented in Spanish, resulting in non-word tokens #1410

Closed adno closed 3 days ago

adno commented 1 month ago

Describe the bug Single words tend to be over-segmented in Spanish, e.g. "abundoso" is being split as "abundos" (noun) + "o" (conjunction), if it's the only input for the Spanish tokenize,mwt pipeline. What's particualrly problematic, is that the first token is typically a non-existing word.

To Reproduce Run the following code:

from collections import Counter
import stanza
pos_nlp = stanza.Pipeline(lang='es', processors='tokenize,mwt,pos')

ws = [
    'abundoso',
    'aceitoso',
    'acetoso',
    'achacoso',
    'afectuoso',
    'afrentoso',
    'airoso',
    'alevoso',
    'algodonoso',
    'algoso',
    'amargoso',
    'amistoso',
    'amoroso',
    'ampuloso',
    'anchuroso',
    'andrajoso',
    'anginoso',
    'anguloso',
    'anheloso',
    'animoso',
    'aparatoso',
    'apestoso',
    'apetitoso',
    'arcilloso',
    'ardoroso',
    'arenoso',
    'argentoso',
    'arterioso',
    'asombroso',
    'asqueroso',
    'avaricioso',
    'azaroso',
    'añoso',
    'baboso',
    'barroso',
    'belicoso',
    'bituminoso',
    'bochornoso',
    'bondadoso',
    'brumoso',
    'bulboso',
    'bullicioso',
    'caballeroso',
    'calamitoso',
    'calimoso',
    'calinoso',
    'calloso',
    'calmoso',
    'caluroso',
    'canceroso',
    'candoroso',
    'canoso',
    'caprichoso',
    'carbonoso',
    'cariñoso',
    'carnoso',
    'cascajoso',
    'caudaloso',
    'cauteloso',
    'cavernoso',
    'caviloso',
    'cenagoso',
    'cerdoso',
    'chismoso',
    'chistoso',
    'clamoroso',
    'cochambroso',
    'comatoso',
    'compendioso',
    'corchoso',
    'cremoso',
    'cuarzoso',
    'cuidadoso',
    'dadivoso',
    'dañoso',
    'deleitoso',
    'desdeñoso',
    'desventajoso',
    'dichoso',
    'dificultoso',
    'disgustoso',
    'dispendioso',
    'doloroso',
    'doloso',
    'donoso',
    'dudoso',
    'edematoso',
    'embarazoso',
    'embrolloso',
    'empachoso',
    'empalagoso',
    'endovenoso',
    'enfadoso',
    'engañoso',
    'engorroso',
    'enjundioso',
    'enojoso',
    'enredoso',
    'escabroso',
    'escandaloso',
    'escatimoso',
    'escrupuloso',
    'espantoso',
    'espirituoso',
    'esplendoroso',
    'esponjoso',
    'espumoso',
    'esquistoso',
    'estorboso',
    'estrepitoso',
    'estropajoso',
    'estruendoso',
    'estudioso',
    'excrementoso',
    'exitoso',
    'extremoso',
    'fabuloso',
    'fachendoso',
    'fachoso',
    'facineroso',
    'fastidioso',
    'fatigoso',
    'ferruginoso',
    'fervoroso',
    'fibroso',
    'filamentoso',
    'flatoso',
    'forzoso',
    'fragoroso',
    'fructuoso',
    'fungoso',
    'gajoso',
    'ganchoso',
    'gangrenoso',
    'ganoso',
    'garboso',
    'gargajoso',
    'gaseoso',
    'gastoso',
    'gelatinoso',
    'generoso',
    'giboso',
    'globuloso',
    'glorioso',
    'glutinoso',
    'gomoso',
    'gorgojoso',
    'gozoso',
    'gramoso',
    'granoso',
    'granuloso',
    'grasoso',
    'gravoso',
    'gredoso',
    'grumoso',
    'guardoso',
    'gusanoso',
    'gustoso',
    'habilidoso',
    'hacendoso',
    'harinoso',
    'herboso',
    'hermoso',
    'herrumbroso',
    'hilachoso',
    'hipogloso',
    'hiposo',
    'hojoso',
    'honroso',
    'hormigoso',
    'horroroso',
    'hoyoso',
    'huesoso',
    'humoso',
    'ignominioso',
    'imperioso',
    'impetuoso',
    'incestuoso',
    'indecoroso',
    'industrioso',
    'infructuoso',
    'ingenioso',
    'injurioso',
    'insidioso',
    'intravenoso',
    'irrespetuoso',
    'jabonoso',
    'jubiloso',
    'jugoso',
    'juncoso',
    'laborioso',
    'lacrimoso',
    'ladrilloso',
    'lagrimoso',
    'lamentoso',
    'lamoso',
    'lastimoso',
    'latoso',
    'lechoso',
    'leguminoso',
    'letargoso',
    'leñoso',
    'libidinoso',
    'licoroso',
    'ligamentoso',
    'limoso',
    'lloroso',
    'loboso',
    'lodoso',
    'luctuoso',
    'lujoso',
    'lujurioso',
    'luminoso',
    'lustroso',
    'lutoso',
    'majestuoso',
    'mamoso',
    'maravilloso',
    'marchoso',
    'mareoso',
    'marmoroso',
    'mañoso',
    'medanoso',
    'medroso',
    'meduloso',
    'melindroso',
    'melodioso',
    'membranoso',
    'memorioso',
    'menesteroso',
    'mentiroso',
    'meticuloso',
    'miedoso',
    'milagroso',
    'mimoso',
    'misterioso',
    'modoso',
    'molestoso',
    'montañoso',
    'montoso',
    'montuoso',
    'morboso',
    'mostachoso',
    'mucilaginoso',
    'mucoso',
    'muermoso',
    'musculoso',
    'musgoso',
    'nauseoso',
    'neblinoso',
    'nebuloso',
    'novedoso',
    'nubloso',
    'nuboso',
    'nudoso',
    'numeroso',
    'numinoso',
    'ojeroso',
    'ojoso',
    'oleoso',
    'olivoso',
    'oloroso',
    'ominoso',
    'oneroso',
    'orgulloso',
    'ostentoso',
    'panoso',
    'pantanoso',
    'pasmoso',
    'pastoso',
    'patoso',
    'pavoroso',
    'pecaminoso',
    'pegajoso',
    'penumbroso',
    'perezoso',
    'pesaroso',
    'piadoso',
    'picajoso',
    'pingajoso',
    'piojoso',
    'piritoso',
    'pizarroso',
    'plomoso',
    'plumoso',
    'poderoso',
    'polvoroso',
    'ponzoñoso',
    'populoso',
    'portentoso',
    'presuntuoso',
    'primoroso',
    'provechoso',
    'pruriginoso',
    'pudoroso',
    'pulgoso',
    'pundonoroso',
    'puntilloso',
    'puntoso',
    'quejicoso',
    'quejoso',
    'quejumbroso',
    'quiloso',
    'quisquilloso',
    'raboso',
    'racimoso',
    'ramoso',
    'rayoso',
    'receloso',
    'rentoso',
    'resbaloso',
    'resinoso',
    'revoltoso',
    'rigoroso',
    'riguroso',
    'rizoso',
    'rocalloso',
    'roñoso',
    'rugoso',
    'ruidoso',
    'ruinoso',
    'rumboso',
    'rumoroso',
    'sabroso',
    'saleroso',
    'salitroso',
    'sarmentoso',
    'sarnoso',
    'sedoso',
    'seroso',
    'silboso',
    'silencioso',
    'silvoso',
    'sombroso',
    'soporoso',
    'sospechoso',
    'sudoroso',
    'sulfuroso',
    'suntuoso',
    'tabacoso',
    'tachoso',
    'talentoso',
    'tedioso',
    'tembloroso',
    'temeroso',
    'tempestuoso',
    'tendinoso',
    'tenebroso',
    'tiñoso',
    'todopoderoso',
    'tormentoso',
    'torrentoso',
    'tortuoso',
    'trabajoso',
    'tremoso',
    'tropezoso',
    'tuberculoso',
    'tuberoso',
    'tubuloso',
    'tumultuoso',
    'ulceroso',
    'undoso',
    'untuoso',
    'vagaroso',
    'valeroso',
    'vanidoso',
    'vaporoso',
    'varicoso',
    'varioloso',
    'vasculoso',
    'veleidoso',
    'velloso',
    'venenoso',
    'ventajoso',
    'ventoso',
    'venturoso',
    'verboso',
    'verdinoso',
    'verdoso',
    'vergonzoso',
    'verrugoso',
    'vertiginoso',
    'vesiculoso',
    'victorioso',
    'vigoroso',
    'vinagroso',
    'vinoso',
    'vistoso',
    'vituperioso',
    'voluminoso',
    'voluntarioso',
    'voluptuoso',
    'yesoso',
    'zumoso'
    ]

w2n = {w: len(pos_nlp(w).sentences[0].words) for w in ws}
print(Counter(w2n.values()))  # prints `Counter({2: 384, 3: 10})`

Expected behavior All are single words and should therefore be segmented in one token. As a result we should get this output: Counter({1: 394}).

Environment (please complete the following information): OS: Linux and MacOS (reproducible on both) Python version: python 3.11.9 (hb806964_0_cpython conda-forge) Stanza version: current (dev) torch: 2.3.1

Additional context I'm doing an NLP experiment where I need to tokenize/lemmatize words without context. The data are from a psycholinguistic task, where no context was provided. I've found that adding a period to the words would work as a workaround for most of them, but I believe the tokenizer should work reasonably words for single words as well. (Exceptions from the words listed, where adding a period doesn't help, are 'estruendoso', 'fachendoso', 'hacendoso'.)

This issue affects other single words too, but with adjectives ending in "-oso", it seems very prominent and consistent. Other susceptible words often end with -lo (címbalo, crocodilo), -eo (machaqueo, maniqueo), -la (garla, hortícola), -le (diástole), -me (cuneiforme, adarme), -sa (mayonesa, galactosa). Again the first resulting token is typically a non-existing word (though there are some exceptions, e.g. "machaque" + "o"). Here is a longer list of examples, where the resulting tokens seem to contain a non-word (or at least a very rare word), sorted by the last two characters:

cabelludo
desheredado
abaniqueo
campanilleo
culebreo
filisteo
flanqueo
gorjeo
hebreo
lloriqueo
maniqueo
marisqueo
moqueo
nabateo
pluriempleo
politiqueo
retranqueo
sobajeo
tijereteo
amígdala
arandela
cavernícola
clientela
hipérbola
hortícola
madreperla
oropéndola
diástole
hipérbole
mercachifle
astrágalo
carambolo
carbonilo
cernícalo
chirimbolo
címbalo
clavicémbalo
codicilo
crocodilo
crótalo
dédalo
escándalo
estraperlo
júbilo
libelo
matapalo
murciégalo
níscalo
óvalo
pabilo
pétalo
róbalo
sábalo
sándalo
sépalo
tagalo
tántalo
violoncelo
zócalo
anseriforme
cordiforme
cruciforme
cuneiforme
falciforme
vermiforme
endodermo
barítono
albanesa
desdeñosa
diablesa
galactosa
leguminosa
mayonesa
sudanesa
decimocuarto
AngledLuffa commented 1 month ago

Can confirm this is a problem. Thank you for reporting.

The core of this problem is the expectation that a document ends in a sentence final punctuation, and it's going to make a sentence final punctuation even if that makes no sense.

I tried to add a mechanism where the tokenizer would sometimes drop the final punctuation at the end of a document. It seemed to help with certain clitic pronouns, but clearly there are cases for which this does not fix the problem.

One thing we can do to ameliorate this is to add the words in question as fake "sentences" specifically to the tokenizer training data. Can you confirm that the words which have endings that resemble a clitic - me, lo, la, le - are not verbs? anseriforme for example seems to be a duck, carambolo is a starfruit, but I'm far from a Spanish expert

adno commented 1 month ago

All the words (the whole words before segmentation) are valid words from a linguistic database, although some may be rare: as you have said, anseriforme is an order of birds (actually Latin), carambolo is a star fruit etc.

So the important question is whether the words after segmentation would be valid words or not. I do not know for sure (I'm also not a Spanish expert and I haven't used a dictionary), but the probability of them being valid words is extremely low:

  1. All the words from the second list are valid words from SPALEX, whose authors claim to have "discarded proper nouns and inflected forms of nouns, verbs and adjectives, as well as other compound words", which in my interpretation means, that only infinitives of reflexive verbs (ending in "se", e.g. "suicidarse", "resentirse") are included, not infinitive + clitic of non-reflexive verbs (e.g. "verme"). Ambigous items (e.g. "verme": "intestinal worm" vs. "see me") would not be desirable for lexical decision time experiments.
  2. After segmentation, all the leading tokens (e.g. "anserifor" in "anseriforme") of the words in the second list, have zero frequency in a subtitle-based 160+M token corpus I am building (not released yet), which almost guarantees they are not valid verbs. I have discarded cases such as "machaqueo", where the spurrious segmentation into "machaque" + "o" results in two words occurring in the corpus, and indeed both being valid words.
adno commented 1 month ago

And the words from the first list (ending in -oso) are valid words from SPALEX as well. In their case it's clear that there is no clitic. In fact, the tokenizer in most cases splits the final "o" as a CCONJ. I don't know much about Spanish orthography, but I thought that conjunctions are always separated by spaces.

adno commented 1 month ago

I understand that this is a model, and it's only as good as its training data. At the same time, tweets, comments, subtitles, various user inputs etc. do not always end with punctuation, and NLP tools still should be able to process them correctly. So I appreciate that you are looking into how it could be fixed!

AngledLuffa commented 1 month ago

Alright, sounds good. Are all the oso words adjectives? Also, are you able to split up the second list of words by their UPOS - noun, adj, etc?

I find that building a tokenizer with only half of the words isn't sufficient - the others all wind up being incorrectly chopped up. So, I built one with all of the oso words. It doesn't seem to hurt the performance of everything else. That's now the dev branch default model, so hopefully you can see an improvement there.

Let me know about the other words, or perhaps I can automate it in some way (but easier for me if you already have the answer).

For the record, the pos & depparse are separate from the tokenizer, hence treating o as a CCONJ once it's been incorrectly split off.

AngledLuffa commented 1 month ago

I went through some of the words myself, found in the process that there was no free version of Spanish WordNet that I could find, and gave up on giving them all set tags. I pushed the newest version of the tokenizer with those words added as non-tokenized segments, so hopefully that works better for you.

adno commented 4 weeks ago

Sorry for taking time to reply. According to the SPALEX paper mentioned earlier, all the words are extracted either from EsPal or BuscaPalabras. The latter seems to be defunct, but the former is still available and has a page for searching for word POS, lemma etc. I found all words from both of the lists in EsPal.

Are all the oso words adjectives?

Yes, all of them are adjectives. Some can also be interpreted as nouns (see below), but considering them adjectives when there is no other context probably makes more sense.

Below I list all of the words with the POS (comma-separated) found in Espal (not UD POS). I'm omitting cases, where the token can also be an inflected form of another word (i.e. in all the cases below the words are in their lemma form):

word    POS
abaniqueo   NOUN
abundoso    ADJECTIVE
aceitoso    ADJECTIVE,NOUN
acetoso ADJECTIVE
achacoso    ADJECTIVE
afectuoso   ADJECTIVE
afrentoso   ADJECTIVE,NOUN
airoso  ADJECTIVE,NOUN
albanesa    NOUN
alevoso ADJECTIVE,NOUN
algodonoso  ADJECTIVE
algoso  ADJECTIVE,NOUN
amargoso    ADJECTIVE
amistoso    ADJECTIVE,NOUN
amoroso ADJECTIVE,NOUN
ampuloso    ADJECTIVE
amígdala    NOUN
anchuroso   ADJECTIVE,NOUN
andrajoso   ADJECTIVE
anginoso    ADJECTIVE
anguloso    ADJECTIVE
anheloso    ADJECTIVE
animoso ADJECTIVE,NOUN
anseriforme ADJECTIVE,NOUN
aparatoso   ADJECTIVE,NOUN
apestoso    ADJECTIVE,NOUN
apetitoso   ADJECTIVE
arandela    NOUN
arcilloso   ADJECTIVE
ardoroso    ADJECTIVE,NOUN
arenoso ADJECTIVE,NOUN
argentoso   ADJECTIVE
arterioso   ADJECTIVE
asombroso   ADJECTIVE,NOUN
asqueroso   ADJECTIVE,NOUN
astrágalo   NOUN
avaricioso  ADJECTIVE
azaroso ADJECTIVE,NOUN
añoso   ADJECTIVE,NOUN
baboso  ADJECTIVE,NOUN
barroso ADJECTIVE,NOUN
barítono    ADJECTIVE,NOUN
belicoso    ADJECTIVE,NOUN
bituminoso  ADJECTIVE,NOUN
bochornoso  ADJECTIVE,NOUN
bondadoso   ADJECTIVE,NOUN
brumoso ADJECTIVE,NOUN
bulboso ADJECTIVE
bullicioso  ADJECTIVE,NOUN
caballeroso ADJECTIVE
cabelludo   ADJECTIVE,NOUN
calamitoso  ADJECTIVE
calimoso    ADJECTIVE
calinoso    ADJECTIVE,NOUN
calloso ADJECTIVE,NOUN
calmoso ADJECTIVE
caluroso    ADJECTIVE,NOUN
campanilleo NOUN
canceroso   ADJECTIVE
candoroso   ADJECTIVE
canoso  ADJECTIVE,NOUN
caprichoso  ADJECTIVE,NOUN
carambolo   NOUN
carbonilo   NOUN
carbonoso   ADJECTIVE
cariñoso    ADJECTIVE,NOUN
carnoso ADJECTIVE,NOUN
cascajoso   ADJECTIVE
caudaloso   ADJECTIVE
cauteloso   ADJECTIVE,NOUN
cavernoso   ADJECTIVE
cavernícola ADJECTIVE,NOUN
caviloso    ADJECTIVE,NOUN
cenagoso    ADJECTIVE
cerdoso ADJECTIVE
cernícalo   NOUN
chirimbolo  NOUN
chismoso    ADJECTIVE,NOUN
chistoso    ADJECTIVE,NOUN
clamoroso   ADJECTIVE,NOUN
clavicémbalo    NOUN
clientela   NOUN
cochambroso ADJECTIVE,NOUN
codicilo    NOUN
comatoso    ADJECTIVE,NOUN
compendioso ADJECTIVE
corchoso    ADJECTIVE
cordiforme  ADJECTIVE,NOUN
cremoso ADJECTIVE
crocodilo   NOUN
cruciforme  ADJECTIVE,NOUN
crótalo NOUN
cuarzoso    ADJECTIVE
cuidadoso   ADJECTIVE,NOUN
culebreo    NOUN
cuneiforme  ADJECTIVE,NOUN
címbalo NOUN
dadivoso    ADJECTIVE,NOUN
dañoso  ADJECTIVE,NOUN
decimocuarto    ADJECTIVE,NOUN
deleitoso   ADJECTIVE,NOUN
desdeñosa   NOUN
desdeñoso   ADJECTIVE,NOUN
desheredado ADJECTIVE,NOUN
desventajoso    ADJECTIVE
diablesa    NOUN
dichoso ADJECTIVE,NOUN
dificultoso ADJECTIVE
disgustoso  ADJECTIVE
dispendioso ADJECTIVE
diástole    NOUN
doloroso    ADJECTIVE,NOUN
doloso  ADJECTIVE,NOUN
donoso  ADJECTIVE,NOUN
dudoso  ADJECTIVE,NOUN
dédalo  NOUN
edematoso   ADJECTIVE
embarazoso  ADJECTIVE
embrolloso  ADJECTIVE
empachoso   ADJECTIVE
empalagoso  ADJECTIVE,NOUN
endodermo   NOUN
endovenoso  ADJECTIVE
enfadoso    ADJECTIVE
engañoso    ADJECTIVE,NOUN
engorroso   ADJECTIVE
enjundioso  ADJECTIVE
enojoso ADJECTIVE
enredoso    ADJECTIVE
escabroso   ADJECTIVE
escandaloso ADJECTIVE,NOUN
escatimoso  ADJECTIVE
escrupuloso ADJECTIVE,NOUN
escándalo   NOUN
espantoso   ADJECTIVE,NOUN
espirituoso ADJECTIVE
esplendoroso    ADJECTIVE,NOUN
esponjoso   ADJECTIVE,NOUN
espumoso    ADJECTIVE,NOUN
esquistoso  ADJECTIVE
estorboso   ADJECTIVE
estraperlo  NOUN
estrepitoso ADJECTIVE,NOUN
estropajoso ADJECTIVE
estruendoso ADJECTIVE
estudioso   ADJECTIVE,NOUN
excrementoso    ADJECTIVE
exitoso ADJECTIVE,NOUN
extremoso   ADJECTIVE
fabuloso    ADJECTIVE,NOUN
fachendoso  ADJECTIVE,NOUN
fachoso ADJECTIVE,NOUN
facineroso  ADJECTIVE,NOUN
falciforme  ADJECTIVE
fastidioso  ADJECTIVE,NOUN
fatigoso    ADJECTIVE
ferruginoso ADJECTIVE
fervoroso   ADJECTIVE,NOUN
fibroso ADJECTIVE
filamentoso ADJECTIVE
filisteo    ADJECTIVE,NOUN
flanqueo    NOUN
flatoso ADJECTIVE
forzoso ADJECTIVE,NOUN
fragoroso   ADJECTIVE
fructuoso   ADJECTIVE,NOUN
fungoso ADJECTIVE
gajoso  ADJECTIVE
galactosa   NOUN
ganchoso    ADJECTIVE,NOUN
gangrenoso  ADJECTIVE
ganoso  ADJECTIVE,NOUN
garboso ADJECTIVE,NOUN
gargajoso   ADJECTIVE
gaseoso ADJECTIVE,NOUN
gastoso ADJECTIVE
gelatinoso  ADJECTIVE
generoso    ADJECTIVE,NOUN
giboso  ADJECTIVE,NOUN
globuloso   ADJECTIVE
glorioso    ADJECTIVE,NOUN
glutinoso   ADJECTIVE
gomoso  ADJECTIVE,NOUN
gorgojoso   ADJECTIVE
gorjeo  NOUN
gozoso  ADJECTIVE,NOUN
gramoso ADJECTIVE
granoso ADJECTIVE
granuloso   ADJECTIVE
grasoso ADJECTIVE,NOUN
gravoso ADJECTIVE
gredoso ADJECTIVE
grumoso ADJECTIVE
guardoso    ADJECTIVE
gusanoso    ADJECTIVE
gustoso ADJECTIVE,NOUN
habilidoso  ADJECTIVE
hacendoso   ADJECTIVE
harinoso    ADJECTIVE,NOUN
hebreo  ADJECTIVE,NOUN
herboso ADJECTIVE,NOUN
hermoso ADJECTIVE,NOUN
herrumbroso ADJECTIVE,NOUN
hilachoso   ADJECTIVE
hipogloso   ADJECTIVE,NOUN
hiposo  ADJECTIVE
hipérbola   NOUN
hipérbole   NOUN
hojoso  ADJECTIVE
honroso ADJECTIVE,NOUN
hormigoso   ADJECTIVE
horroroso   ADJECTIVE,NOUN
hortícola   ADJECTIVE,NOUN
hoyoso  ADJECTIVE
huesoso ADJECTIVE
humoso  ADJECTIVE,NOUN
ignominioso ADJECTIVE
imperioso   ADJECTIVE
impetuoso   ADJECTIVE,NOUN
incestuoso  ADJECTIVE,NOUN
indecoroso  ADJECTIVE
industrioso ADJECTIVE
infructuoso ADJECTIVE
ingenioso   ADJECTIVE,NOUN
injurioso   ADJECTIVE
insidioso   ADJECTIVE,NOUN
intravenoso ADJECTIVE
irrespetuoso    ADJECTIVE
jabonoso    ADJECTIVE
jubiloso    ADJECTIVE,NOUN
jugoso  ADJECTIVE,NOUN
juncoso ADJECTIVE
júbilo  NOUN
laborioso   ADJECTIVE,NOUN
lacrimoso   ADJECTIVE
ladrilloso  ADJECTIVE
lagrimoso   ADJECTIVE
lamentoso   ADJECTIVE
lamoso  ADJECTIVE,NOUN
lastimoso   ADJECTIVE
latoso  ADJECTIVE
lechoso ADJECTIVE,NOUN
leguminosa  NOUN
leguminoso  ADJECTIVE
letargoso   ADJECTIVE
leñoso  ADJECTIVE
libelo  NOUN
libidinoso  ADJECTIVE,NOUN
licoroso    ADJECTIVE
ligamentoso ADJECTIVE
limoso  ADJECTIVE
lloriqueo   NOUN
lloroso ADJECTIVE,NOUN
loboso  ADJECTIVE
lodoso  ADJECTIVE,NOUN
luctuoso    ADJECTIVE,NOUN
lujoso  ADJECTIVE,NOUN
lujurioso   ADJECTIVE,NOUN
luminoso    ADJECTIVE,NOUN
lustroso    ADJECTIVE
lutoso  ADJECTIVE
madreperla  NOUN
majestuoso  ADJECTIVE,NOUN
mamoso  ADJECTIVE
maniqueo    ADJECTIVE,NOUN
maravilloso ADJECTIVE,NOUN
marchoso    ADJECTIVE
mareoso ADJECTIVE
marisqueo   NOUN
marmoroso   ADJECTIVE
matapalo    NOUN
mayonesa    NOUN
mañoso  ADJECTIVE,NOUN
medanoso    ADJECTIVE,NOUN
medroso ADJECTIVE,NOUN
meduloso    ADJECTIVE
melindroso  ADJECTIVE,NOUN
melodioso   ADJECTIVE,NOUN
membranoso  ADJECTIVE
memorioso   ADJECTIVE,NOUN
menesteroso ADJECTIVE,NOUN
mentiroso   ADJECTIVE,NOUN
mercachifle NOUN
meticuloso  ADJECTIVE,NOUN
miedoso ADJECTIVE,NOUN
milagroso   ADJECTIVE,NOUN
mimoso  ADJECTIVE,NOUN
misterioso  ADJECTIVE,NOUN
modoso  ADJECTIVE
molestoso   ADJECTIVE
montañoso   ADJECTIVE,NOUN
montoso ADJECTIVE,NOUN
montuoso    ADJECTIVE,NOUN
moqueo  NOUN
morboso ADJECTIVE,NOUN
mostachoso  ADJECTIVE
mucilaginoso    ADJECTIVE
mucoso  ADJECTIVE
muermoso    ADJECTIVE
murciégalo  NOUN
musculoso   ADJECTIVE,NOUN
musgoso ADJECTIVE
nabateo ADJECTIVE,NOUN
nauseoso    ADJECTIVE
neblinoso   ADJECTIVE,NOUN
nebuloso    ADJECTIVE,NOUN
novedoso    ADJECTIVE,NOUN
nubloso ADJECTIVE,NOUN
nuboso  ADJECTIVE,NOUN
nudoso  ADJECTIVE
numeroso    ADJECTIVE,NOUN
numinoso    ADJECTIVE,NOUN
níscalo NOUN
ojeroso ADJECTIVE
ojoso   ADJECTIVE
oleoso  ADJECTIVE,NOUN
olivoso ADJECTIVE
oloroso ADJECTIVE,NOUN
ominoso ADJECTIVE
oneroso ADJECTIVE,NOUN
orgulloso   ADJECTIVE,NOUN
oropéndola  NOUN
ostentoso   ADJECTIVE
pabilo  NOUN
panoso  ADJECTIVE,NOUN
pantanoso   ADJECTIVE,NOUN
pasmoso ADJECTIVE
pastoso ADJECTIVE
patoso  ADJECTIVE,NOUN
pavoroso    ADJECTIVE,NOUN
pecaminoso  ADJECTIVE,NOUN
pegajoso    ADJECTIVE,NOUN
penumbroso  ADJECTIVE
perezoso    ADJECTIVE,ADVERB,NOUN
pesaroso    ADJECTIVE
piadoso ADJECTIVE,NOUN
picajoso    ADJECTIVE,NOUN
pingajoso   ADJECTIVE
piojoso ADJECTIVE,NOUN
piritoso    ADJECTIVE
pizarroso   ADJECTIVE,NOUN
plomoso ADJECTIVE
plumoso ADJECTIVE
pluriempleo NOUN
poderoso    ADJECTIVE,NOUN
politiqueo  NOUN
polvoroso   ADJECTIVE
ponzoñoso   ADJECTIVE
populoso    ADJECTIVE,NOUN
portentoso  ADJECTIVE,NOUN
presuntuoso ADJECTIVE,NOUN
primoroso   ADJECTIVE,NOUN
provechoso  ADJECTIVE,NOUN
pruriginoso ADJECTIVE
pudoroso    ADJECTIVE,NOUN
pulgoso ADJECTIVE,NOUN
pundonoroso ADJECTIVE,NOUN
puntilloso  ADJECTIVE
puntoso ADJECTIVE,NOUN
pétalo  NOUN
quejicoso   ADJECTIVE
quejoso ADJECTIVE
quejumbroso ADJECTIVE
quiloso ADJECTIVE
quisquilloso    ADJECTIVE,NOUN
raboso  ADJECTIVE,NOUN
racimoso    ADJECTIVE
ramoso  ADJECTIVE
rayoso  ADJECTIVE
receloso    ADJECTIVE,NOUN
rentoso ADJECTIVE
resbaloso   ADJECTIVE
resinoso    ADJECTIVE,NOUN
retranqueo  NOUN
revoltoso   ADJECTIVE,NOUN
rigoroso    ADJECTIVE
riguroso    ADJECTIVE,NOUN
rizoso  ADJECTIVE,NOUN
rocalloso   ADJECTIVE
roñoso  ADJECTIVE
rugoso  ADJECTIVE,NOUN
ruidoso ADJECTIVE,NOUN
ruinoso ADJECTIVE
rumboso ADJECTIVE,NOUN
rumoroso    ADJECTIVE,NOUN
róbalo  NOUN
sabroso ADJECTIVE,NOUN
saleroso    ADJECTIVE
salitroso   ADJECTIVE,NOUN
sarmentoso  ADJECTIVE
sarnoso ADJECTIVE,NOUN
sedoso  ADJECTIVE,NOUN
seroso  ADJECTIVE
silboso ADJECTIVE
silencioso  ADJECTIVE,NOUN
silvoso ADJECTIVE
sobajeo NOUN
sombroso    ADJECTIVE
soporoso    ADJECTIVE
sospechoso  ADJECTIVE,NOUN
sudanesa    NOUN
sudoroso    ADJECTIVE
sulfuroso   ADJECTIVE
suntuoso    ADJECTIVE
sábalo  NOUN
sándalo NOUN
sépalo  NOUN
tabacoso    ADJECTIVE
tachoso ADJECTIVE
tagalo  ADJECTIVE,NOUN
talentoso   ADJECTIVE,NOUN
tedioso ADJECTIVE,NOUN
tembloroso  ADJECTIVE,NOUN
temeroso    ADJECTIVE,NOUN
tempestuoso ADJECTIVE,NOUN
tendinoso   ADJECTIVE
tenebroso   ADJECTIVE,NOUN
tijereteo   NOUN
tiñoso  ADJECTIVE,NOUN
todopoderoso    ADJECTIVE,NOUN
tormentoso  ADJECTIVE,NOUN
torrentoso  ADJECTIVE
tortuoso    ADJECTIVE,NOUN
trabajoso   ADJECTIVE
tremoso ADJECTIVE
tropezoso   ADJECTIVE
tuberculoso ADJECTIVE,NOUN
tuberoso    ADJECTIVE,NOUN
tubuloso    ADJECTIVE
tumultuoso  ADJECTIVE,NOUN
tántalo NOUN
ulceroso    ADJECTIVE
undoso  ADJECTIVE,NOUN
untuoso ADJECTIVE
vagaroso    ADJECTIVE
valeroso    ADJECTIVE,NOUN
vanidoso    ADJECTIVE,NOUN
vaporoso    ADJECTIVE
varicoso    ADJECTIVE,NOUN
varioloso   ADJECTIVE,NOUN
vasculoso   ADJECTIVE
veleidoso   ADJECTIVE,NOUN
velloso ADJECTIVE,NOUN
venenoso    ADJECTIVE,NOUN
ventajoso   ADJECTIVE,NOUN
ventoso ADJECTIVE,NOUN
venturoso   ADJECTIVE,NOUN
verboso ADJECTIVE
verdinoso   ADJECTIVE
verdoso ADJECTIVE,NOUN
vergonzoso  ADJECTIVE,NOUN
vermiforme  ADJECTIVE
verrugoso   ADJECTIVE,NOUN
vertiginoso ADJECTIVE,NOUN
vesiculoso  ADJECTIVE
victorioso  ADJECTIVE,NOUN
vigoroso    ADJECTIVE,NOUN
vinagroso   ADJECTIVE
vinoso  ADJECTIVE,NOUN
violoncelo  NOUN
vistoso ADJECTIVE,NOUN
vituperioso ADJECTIVE
voluminoso  ADJECTIVE,NOUN
voluntarioso    ADJECTIVE,NOUN
voluptuoso  ADJECTIVE,NOUN
yesoso  ADJECTIVE
zumoso  ADJECTIVE
zócalo  NOUN
óvalo   NOUN

Additionally the following tokens can be interpreted as forms of different lemmas:

If you need more information, you can use the EsPal page directly.

Thank you for your work!

AngledLuffa commented 4 weeks ago

Thank you for doing that. It will save us the time of tracking that down ourselves.

Have you had a chance to look at the new tokenizer? Hopefully it performs much better on the words you listed without sacrificing quality elsewhere.