Open sassbalint opened 2 years ago
An idea for a probably not perfect workaround.
Change line 136 of emCompound.py
wd.compound = '#'.join(split_at(wd.lemma, boundaries))
to the following:
try:
wd.compound = '#'.join(split_at(wd.lemma, boundaries))
except IndexError:
wd.compound = wd.lemma
If we ignore the technical side (the bug) for the moment, what do we exactly want the output to be in these cases?
compound
should be either ENSZ#meg#bízott
or ENSZ-#meg#bízott
or ENSZ#-meg#bízott
.compound
and see whether the remaining components are themselves compounds. So for example in tánc- és illemtanár, the token tánc- should be compound='tánc'
, and in torlósugár- és hiperszonikus hajtóművek, torlósugár- should be compound='torló#sugár'
. (It would be nice to be able to reconstruct the whole compound in these cases, i.e. tánctanár and torlósugár-hajtómű. This sounds like an interesting small project the results of which could be relevant for coordination and ellipsis researchers.) Anyway, since we can't tell by looking at anas
that -megerősödve isn't really a word, I think it should also be tagged as meg#erősödik
./
(similarly to those that contain a +
) can be ignored by emCompound
, as they don't seem to be valid compounds. In other words, I wouldn't want to assign the compound
value of viselked#ki#néz
or anything like this to this spurious token.
Phenomenon:
Some additional examples:
Problem:
The problem lies in that we have constructions where the first part does not have a preverb while the second part does.
Gergő, please look at this issue.