Closed adno closed 3 days ago
Can confirm this is a problem. Thank you for reporting.
The core of this problem is the expectation that a document ends in a sentence final punctuation, and it's going to make a sentence final punctuation even if that makes no sense.
I tried to add a mechanism where the tokenizer would sometimes drop the final punctuation at the end of a document. It seemed to help with certain clitic pronouns, but clearly there are cases for which this does not fix the problem.
One thing we can do to ameliorate this is to add the words in question as fake "sentences" specifically to the tokenizer training data. Can you confirm that the words which have endings that resemble a clitic - me
, lo
, la
, le
- are not verbs? anseriforme
for example seems to be a duck, carambolo is a starfruit, but I'm far from a Spanish expert
All the words (the whole words before segmentation) are valid words from a linguistic database, although some may be rare: as you have said, anseriforme is an order of birds (actually Latin), carambolo is a star fruit etc.
So the important question is whether the words after segmentation would be valid words or not. I do not know for sure (I'm also not a Spanish expert and I haven't used a dictionary), but the probability of them being valid words is extremely low:
And the words from the first list (ending in -oso) are valid words from SPALEX as well. In their case it's clear that there is no clitic. In fact, the tokenizer in most cases splits the final "o" as a CCONJ. I don't know much about Spanish orthography, but I thought that conjunctions are always separated by spaces.
I understand that this is a model, and it's only as good as its training data. At the same time, tweets, comments, subtitles, various user inputs etc. do not always end with punctuation, and NLP tools still should be able to process them correctly. So I appreciate that you are looking into how it could be fixed!
Alright, sounds good. Are all the oso
words adjectives? Also, are you able to split up the second list of words by their UPOS - noun, adj, etc?
I find that building a tokenizer with only half of the words isn't sufficient - the others all wind up being incorrectly chopped up. So, I built one with all of the oso
words. It doesn't seem to hurt the performance of everything else. That's now the dev branch default model, so hopefully you can see an improvement there.
Let me know about the other words, or perhaps I can automate it in some way (but easier for me if you already have the answer).
For the record, the pos & depparse are separate from the tokenizer, hence treating o
as a CCONJ
once it's been incorrectly split off.
I went through some of the words myself, found in the process that there was no free version of Spanish WordNet that I could find, and gave up on giving them all set tags. I pushed the newest version of the tokenizer with those words added as non-tokenized segments, so hopefully that works better for you.
Sorry for taking time to reply. According to the SPALEX paper mentioned earlier, all the words are extracted either from EsPal or BuscaPalabras. The latter seems to be defunct, but the former is still available and has a page for searching for word POS, lemma etc. I found all words from both of the lists in EsPal.
Are all the oso words adjectives?
Yes, all of them are adjectives. Some can also be interpreted as nouns (see below), but considering them adjectives when there is no other context probably makes more sense.
Below I list all of the words with the POS (comma-separated) found in Espal (not UD POS). I'm omitting cases, where the token can also be an inflected form of another word (i.e. in all the cases below the words are in their lemma form):
word POS
abaniqueo NOUN
abundoso ADJECTIVE
aceitoso ADJECTIVE,NOUN
acetoso ADJECTIVE
achacoso ADJECTIVE
afectuoso ADJECTIVE
afrentoso ADJECTIVE,NOUN
airoso ADJECTIVE,NOUN
albanesa NOUN
alevoso ADJECTIVE,NOUN
algodonoso ADJECTIVE
algoso ADJECTIVE,NOUN
amargoso ADJECTIVE
amistoso ADJECTIVE,NOUN
amoroso ADJECTIVE,NOUN
ampuloso ADJECTIVE
amígdala NOUN
anchuroso ADJECTIVE,NOUN
andrajoso ADJECTIVE
anginoso ADJECTIVE
anguloso ADJECTIVE
anheloso ADJECTIVE
animoso ADJECTIVE,NOUN
anseriforme ADJECTIVE,NOUN
aparatoso ADJECTIVE,NOUN
apestoso ADJECTIVE,NOUN
apetitoso ADJECTIVE
arandela NOUN
arcilloso ADJECTIVE
ardoroso ADJECTIVE,NOUN
arenoso ADJECTIVE,NOUN
argentoso ADJECTIVE
arterioso ADJECTIVE
asombroso ADJECTIVE,NOUN
asqueroso ADJECTIVE,NOUN
astrágalo NOUN
avaricioso ADJECTIVE
azaroso ADJECTIVE,NOUN
añoso ADJECTIVE,NOUN
baboso ADJECTIVE,NOUN
barroso ADJECTIVE,NOUN
barítono ADJECTIVE,NOUN
belicoso ADJECTIVE,NOUN
bituminoso ADJECTIVE,NOUN
bochornoso ADJECTIVE,NOUN
bondadoso ADJECTIVE,NOUN
brumoso ADJECTIVE,NOUN
bulboso ADJECTIVE
bullicioso ADJECTIVE,NOUN
caballeroso ADJECTIVE
cabelludo ADJECTIVE,NOUN
calamitoso ADJECTIVE
calimoso ADJECTIVE
calinoso ADJECTIVE,NOUN
calloso ADJECTIVE,NOUN
calmoso ADJECTIVE
caluroso ADJECTIVE,NOUN
campanilleo NOUN
canceroso ADJECTIVE
candoroso ADJECTIVE
canoso ADJECTIVE,NOUN
caprichoso ADJECTIVE,NOUN
carambolo NOUN
carbonilo NOUN
carbonoso ADJECTIVE
cariñoso ADJECTIVE,NOUN
carnoso ADJECTIVE,NOUN
cascajoso ADJECTIVE
caudaloso ADJECTIVE
cauteloso ADJECTIVE,NOUN
cavernoso ADJECTIVE
cavernícola ADJECTIVE,NOUN
caviloso ADJECTIVE,NOUN
cenagoso ADJECTIVE
cerdoso ADJECTIVE
cernícalo NOUN
chirimbolo NOUN
chismoso ADJECTIVE,NOUN
chistoso ADJECTIVE,NOUN
clamoroso ADJECTIVE,NOUN
clavicémbalo NOUN
clientela NOUN
cochambroso ADJECTIVE,NOUN
codicilo NOUN
comatoso ADJECTIVE,NOUN
compendioso ADJECTIVE
corchoso ADJECTIVE
cordiforme ADJECTIVE,NOUN
cremoso ADJECTIVE
crocodilo NOUN
cruciforme ADJECTIVE,NOUN
crótalo NOUN
cuarzoso ADJECTIVE
cuidadoso ADJECTIVE,NOUN
culebreo NOUN
cuneiforme ADJECTIVE,NOUN
címbalo NOUN
dadivoso ADJECTIVE,NOUN
dañoso ADJECTIVE,NOUN
decimocuarto ADJECTIVE,NOUN
deleitoso ADJECTIVE,NOUN
desdeñosa NOUN
desdeñoso ADJECTIVE,NOUN
desheredado ADJECTIVE,NOUN
desventajoso ADJECTIVE
diablesa NOUN
dichoso ADJECTIVE,NOUN
dificultoso ADJECTIVE
disgustoso ADJECTIVE
dispendioso ADJECTIVE
diástole NOUN
doloroso ADJECTIVE,NOUN
doloso ADJECTIVE,NOUN
donoso ADJECTIVE,NOUN
dudoso ADJECTIVE,NOUN
dédalo NOUN
edematoso ADJECTIVE
embarazoso ADJECTIVE
embrolloso ADJECTIVE
empachoso ADJECTIVE
empalagoso ADJECTIVE,NOUN
endodermo NOUN
endovenoso ADJECTIVE
enfadoso ADJECTIVE
engañoso ADJECTIVE,NOUN
engorroso ADJECTIVE
enjundioso ADJECTIVE
enojoso ADJECTIVE
enredoso ADJECTIVE
escabroso ADJECTIVE
escandaloso ADJECTIVE,NOUN
escatimoso ADJECTIVE
escrupuloso ADJECTIVE,NOUN
escándalo NOUN
espantoso ADJECTIVE,NOUN
espirituoso ADJECTIVE
esplendoroso ADJECTIVE,NOUN
esponjoso ADJECTIVE,NOUN
espumoso ADJECTIVE,NOUN
esquistoso ADJECTIVE
estorboso ADJECTIVE
estraperlo NOUN
estrepitoso ADJECTIVE,NOUN
estropajoso ADJECTIVE
estruendoso ADJECTIVE
estudioso ADJECTIVE,NOUN
excrementoso ADJECTIVE
exitoso ADJECTIVE,NOUN
extremoso ADJECTIVE
fabuloso ADJECTIVE,NOUN
fachendoso ADJECTIVE,NOUN
fachoso ADJECTIVE,NOUN
facineroso ADJECTIVE,NOUN
falciforme ADJECTIVE
fastidioso ADJECTIVE,NOUN
fatigoso ADJECTIVE
ferruginoso ADJECTIVE
fervoroso ADJECTIVE,NOUN
fibroso ADJECTIVE
filamentoso ADJECTIVE
filisteo ADJECTIVE,NOUN
flanqueo NOUN
flatoso ADJECTIVE
forzoso ADJECTIVE,NOUN
fragoroso ADJECTIVE
fructuoso ADJECTIVE,NOUN
fungoso ADJECTIVE
gajoso ADJECTIVE
galactosa NOUN
ganchoso ADJECTIVE,NOUN
gangrenoso ADJECTIVE
ganoso ADJECTIVE,NOUN
garboso ADJECTIVE,NOUN
gargajoso ADJECTIVE
gaseoso ADJECTIVE,NOUN
gastoso ADJECTIVE
gelatinoso ADJECTIVE
generoso ADJECTIVE,NOUN
giboso ADJECTIVE,NOUN
globuloso ADJECTIVE
glorioso ADJECTIVE,NOUN
glutinoso ADJECTIVE
gomoso ADJECTIVE,NOUN
gorgojoso ADJECTIVE
gorjeo NOUN
gozoso ADJECTIVE,NOUN
gramoso ADJECTIVE
granoso ADJECTIVE
granuloso ADJECTIVE
grasoso ADJECTIVE,NOUN
gravoso ADJECTIVE
gredoso ADJECTIVE
grumoso ADJECTIVE
guardoso ADJECTIVE
gusanoso ADJECTIVE
gustoso ADJECTIVE,NOUN
habilidoso ADJECTIVE
hacendoso ADJECTIVE
harinoso ADJECTIVE,NOUN
hebreo ADJECTIVE,NOUN
herboso ADJECTIVE,NOUN
hermoso ADJECTIVE,NOUN
herrumbroso ADJECTIVE,NOUN
hilachoso ADJECTIVE
hipogloso ADJECTIVE,NOUN
hiposo ADJECTIVE
hipérbola NOUN
hipérbole NOUN
hojoso ADJECTIVE
honroso ADJECTIVE,NOUN
hormigoso ADJECTIVE
horroroso ADJECTIVE,NOUN
hortícola ADJECTIVE,NOUN
hoyoso ADJECTIVE
huesoso ADJECTIVE
humoso ADJECTIVE,NOUN
ignominioso ADJECTIVE
imperioso ADJECTIVE
impetuoso ADJECTIVE,NOUN
incestuoso ADJECTIVE,NOUN
indecoroso ADJECTIVE
industrioso ADJECTIVE
infructuoso ADJECTIVE
ingenioso ADJECTIVE,NOUN
injurioso ADJECTIVE
insidioso ADJECTIVE,NOUN
intravenoso ADJECTIVE
irrespetuoso ADJECTIVE
jabonoso ADJECTIVE
jubiloso ADJECTIVE,NOUN
jugoso ADJECTIVE,NOUN
juncoso ADJECTIVE
júbilo NOUN
laborioso ADJECTIVE,NOUN
lacrimoso ADJECTIVE
ladrilloso ADJECTIVE
lagrimoso ADJECTIVE
lamentoso ADJECTIVE
lamoso ADJECTIVE,NOUN
lastimoso ADJECTIVE
latoso ADJECTIVE
lechoso ADJECTIVE,NOUN
leguminosa NOUN
leguminoso ADJECTIVE
letargoso ADJECTIVE
leñoso ADJECTIVE
libelo NOUN
libidinoso ADJECTIVE,NOUN
licoroso ADJECTIVE
ligamentoso ADJECTIVE
limoso ADJECTIVE
lloriqueo NOUN
lloroso ADJECTIVE,NOUN
loboso ADJECTIVE
lodoso ADJECTIVE,NOUN
luctuoso ADJECTIVE,NOUN
lujoso ADJECTIVE,NOUN
lujurioso ADJECTIVE,NOUN
luminoso ADJECTIVE,NOUN
lustroso ADJECTIVE
lutoso ADJECTIVE
madreperla NOUN
majestuoso ADJECTIVE,NOUN
mamoso ADJECTIVE
maniqueo ADJECTIVE,NOUN
maravilloso ADJECTIVE,NOUN
marchoso ADJECTIVE
mareoso ADJECTIVE
marisqueo NOUN
marmoroso ADJECTIVE
matapalo NOUN
mayonesa NOUN
mañoso ADJECTIVE,NOUN
medanoso ADJECTIVE,NOUN
medroso ADJECTIVE,NOUN
meduloso ADJECTIVE
melindroso ADJECTIVE,NOUN
melodioso ADJECTIVE,NOUN
membranoso ADJECTIVE
memorioso ADJECTIVE,NOUN
menesteroso ADJECTIVE,NOUN
mentiroso ADJECTIVE,NOUN
mercachifle NOUN
meticuloso ADJECTIVE,NOUN
miedoso ADJECTIVE,NOUN
milagroso ADJECTIVE,NOUN
mimoso ADJECTIVE,NOUN
misterioso ADJECTIVE,NOUN
modoso ADJECTIVE
molestoso ADJECTIVE
montañoso ADJECTIVE,NOUN
montoso ADJECTIVE,NOUN
montuoso ADJECTIVE,NOUN
moqueo NOUN
morboso ADJECTIVE,NOUN
mostachoso ADJECTIVE
mucilaginoso ADJECTIVE
mucoso ADJECTIVE
muermoso ADJECTIVE
murciégalo NOUN
musculoso ADJECTIVE,NOUN
musgoso ADJECTIVE
nabateo ADJECTIVE,NOUN
nauseoso ADJECTIVE
neblinoso ADJECTIVE,NOUN
nebuloso ADJECTIVE,NOUN
novedoso ADJECTIVE,NOUN
nubloso ADJECTIVE,NOUN
nuboso ADJECTIVE,NOUN
nudoso ADJECTIVE
numeroso ADJECTIVE,NOUN
numinoso ADJECTIVE,NOUN
níscalo NOUN
ojeroso ADJECTIVE
ojoso ADJECTIVE
oleoso ADJECTIVE,NOUN
olivoso ADJECTIVE
oloroso ADJECTIVE,NOUN
ominoso ADJECTIVE
oneroso ADJECTIVE,NOUN
orgulloso ADJECTIVE,NOUN
oropéndola NOUN
ostentoso ADJECTIVE
pabilo NOUN
panoso ADJECTIVE,NOUN
pantanoso ADJECTIVE,NOUN
pasmoso ADJECTIVE
pastoso ADJECTIVE
patoso ADJECTIVE,NOUN
pavoroso ADJECTIVE,NOUN
pecaminoso ADJECTIVE,NOUN
pegajoso ADJECTIVE,NOUN
penumbroso ADJECTIVE
perezoso ADJECTIVE,ADVERB,NOUN
pesaroso ADJECTIVE
piadoso ADJECTIVE,NOUN
picajoso ADJECTIVE,NOUN
pingajoso ADJECTIVE
piojoso ADJECTIVE,NOUN
piritoso ADJECTIVE
pizarroso ADJECTIVE,NOUN
plomoso ADJECTIVE
plumoso ADJECTIVE
pluriempleo NOUN
poderoso ADJECTIVE,NOUN
politiqueo NOUN
polvoroso ADJECTIVE
ponzoñoso ADJECTIVE
populoso ADJECTIVE,NOUN
portentoso ADJECTIVE,NOUN
presuntuoso ADJECTIVE,NOUN
primoroso ADJECTIVE,NOUN
provechoso ADJECTIVE,NOUN
pruriginoso ADJECTIVE
pudoroso ADJECTIVE,NOUN
pulgoso ADJECTIVE,NOUN
pundonoroso ADJECTIVE,NOUN
puntilloso ADJECTIVE
puntoso ADJECTIVE,NOUN
pétalo NOUN
quejicoso ADJECTIVE
quejoso ADJECTIVE
quejumbroso ADJECTIVE
quiloso ADJECTIVE
quisquilloso ADJECTIVE,NOUN
raboso ADJECTIVE,NOUN
racimoso ADJECTIVE
ramoso ADJECTIVE
rayoso ADJECTIVE
receloso ADJECTIVE,NOUN
rentoso ADJECTIVE
resbaloso ADJECTIVE
resinoso ADJECTIVE,NOUN
retranqueo NOUN
revoltoso ADJECTIVE,NOUN
rigoroso ADJECTIVE
riguroso ADJECTIVE,NOUN
rizoso ADJECTIVE,NOUN
rocalloso ADJECTIVE
roñoso ADJECTIVE
rugoso ADJECTIVE,NOUN
ruidoso ADJECTIVE,NOUN
ruinoso ADJECTIVE
rumboso ADJECTIVE,NOUN
rumoroso ADJECTIVE,NOUN
róbalo NOUN
sabroso ADJECTIVE,NOUN
saleroso ADJECTIVE
salitroso ADJECTIVE,NOUN
sarmentoso ADJECTIVE
sarnoso ADJECTIVE,NOUN
sedoso ADJECTIVE,NOUN
seroso ADJECTIVE
silboso ADJECTIVE
silencioso ADJECTIVE,NOUN
silvoso ADJECTIVE
sobajeo NOUN
sombroso ADJECTIVE
soporoso ADJECTIVE
sospechoso ADJECTIVE,NOUN
sudanesa NOUN
sudoroso ADJECTIVE
sulfuroso ADJECTIVE
suntuoso ADJECTIVE
sábalo NOUN
sándalo NOUN
sépalo NOUN
tabacoso ADJECTIVE
tachoso ADJECTIVE
tagalo ADJECTIVE,NOUN
talentoso ADJECTIVE,NOUN
tedioso ADJECTIVE,NOUN
tembloroso ADJECTIVE,NOUN
temeroso ADJECTIVE,NOUN
tempestuoso ADJECTIVE,NOUN
tendinoso ADJECTIVE
tenebroso ADJECTIVE,NOUN
tijereteo NOUN
tiñoso ADJECTIVE,NOUN
todopoderoso ADJECTIVE,NOUN
tormentoso ADJECTIVE,NOUN
torrentoso ADJECTIVE
tortuoso ADJECTIVE,NOUN
trabajoso ADJECTIVE
tremoso ADJECTIVE
tropezoso ADJECTIVE
tuberculoso ADJECTIVE,NOUN
tuberoso ADJECTIVE,NOUN
tubuloso ADJECTIVE
tumultuoso ADJECTIVE,NOUN
tántalo NOUN
ulceroso ADJECTIVE
undoso ADJECTIVE,NOUN
untuoso ADJECTIVE
vagaroso ADJECTIVE
valeroso ADJECTIVE,NOUN
vanidoso ADJECTIVE,NOUN
vaporoso ADJECTIVE
varicoso ADJECTIVE,NOUN
varioloso ADJECTIVE,NOUN
vasculoso ADJECTIVE
veleidoso ADJECTIVE,NOUN
velloso ADJECTIVE,NOUN
venenoso ADJECTIVE,NOUN
ventajoso ADJECTIVE,NOUN
ventoso ADJECTIVE,NOUN
venturoso ADJECTIVE,NOUN
verboso ADJECTIVE
verdinoso ADJECTIVE
verdoso ADJECTIVE,NOUN
vergonzoso ADJECTIVE,NOUN
vermiforme ADJECTIVE
verrugoso ADJECTIVE,NOUN
vertiginoso ADJECTIVE,NOUN
vesiculoso ADJECTIVE
victorioso ADJECTIVE,NOUN
vigoroso ADJECTIVE,NOUN
vinagroso ADJECTIVE
vinoso ADJECTIVE,NOUN
violoncelo NOUN
vistoso ADJECTIVE,NOUN
vituperioso ADJECTIVE
voluminoso ADJECTIVE,NOUN
voluntarioso ADJECTIVE,NOUN
voluptuoso ADJECTIVE,NOUN
yesoso ADJECTIVE
zumoso ADJECTIVE
zócalo NOUN
óvalo NOUN
Additionally the following tokens can be interpreted as forms of different lemmas:
albanesa
can be a form of an ADJECTIVE lemma albanés
desdeñosa
<- ADJECTIVE desdeñoso
desheredado
<- VERB desheredar
leguminosa
<- ADJECTIVE leguminoso
sudanesa
<- ADJECTIVE sudanés
If you need more information, you can use the EsPal page directly.
Thank you for your work!
Thank you for doing that. It will save us the time of tracking that down ourselves.
Have you had a chance to look at the new tokenizer? Hopefully it performs much better on the words you listed without sacrificing quality elsewhere.
Describe the bug Single words tend to be over-segmented in Spanish, e.g. "abundoso" is being split as "abundos" (noun) + "o" (conjunction), if it's the only input for the Spanish
tokenize,mwt
pipeline. What's particualrly problematic, is that the first token is typically a non-existing word.To Reproduce Run the following code:
Expected behavior All are single words and should therefore be segmented in one token. As a result we should get this output:
Counter({1: 394})
.Environment (please complete the following information): OS: Linux and MacOS (reproducible on both) Python version: python 3.11.9 (hb806964_0_cpython conda-forge) Stanza version: current (dev) torch: 2.3.1
Additional context I'm doing an NLP experiment where I need to tokenize/lemmatize words without context. The data are from a psycholinguistic task, where no context was provided. I've found that adding a period to the words would work as a workaround for most of them, but I believe the tokenizer should work reasonably words for single words as well. (Exceptions from the words listed, where adding a period doesn't help, are 'estruendoso', 'fachendoso', 'hacendoso'.)
This issue affects other single words too, but with adjectives ending in "-oso", it seems very prominent and consistent. Other susceptible words often end with -lo (címbalo, crocodilo), -eo (machaqueo, maniqueo), -la (garla, hortícola), -le (diástole), -me (cuneiforme, adarme), -sa (mayonesa, galactosa). Again the first resulting token is typically a non-existing word (though there are some exceptions, e.g. "machaque" + "o"). Here is a longer list of examples, where the resulting tokens seem to contain a non-word (or at least a very rare word), sorted by the last two characters: