problem with preverb in the second part of a compound

sassbalint commented 2 years ago

Phenomenon:

$ echo "úgy viselkedik/kinéz ahogy" | curl -F "file=@-" http://EMTSV_SERVER/tok/morph/pos | python3 emCompound/emCompound
...
  File "../emCompound/emCompound/emCompound.py", line 136, in process_sentence
    wd.compound = '#'.join(split_at(wd.lemma, boundaries))
  File "../emCompound/emCompound/emCompound.py", line 298, in split_at
    raise IndexError(f"index beyond bounds: {indices[-1]} in {in_list}")
IndexError: In "no filename for stream" at 5: index beyond bounds: 23 in viselked/kinéz

Some additional examples:

input text	error
"úgy viselkedik/kinéz ahogy"	23 in viselked/kinéz
"az EU- és az ENSZ-megbízott is"	index beyond bounds: 14 in ENSZ-megbíz
"töltődtem.Tehát -megerősödve kicsit- hétfőn úgy"	index beyond bounds: 12 in -megerősödik
"szerint tudnak dolgozni/kivitelezni."	index beyond bounds: 19 in dolgoz/kivitelezik

Problem:

The problem lies in that we have constructions where the first part does not have a preverb while the second part does.

input text
"úgy viselkedik/kinéz ahogy"
"az EU- és az ENSZ-megbízott is"
"töltődtem.Tehát -megerősödve kicsit- hétfőn úgy"
"szerint tudnak dolgozni/kivitelezni."

Gergő, please look at this issue.

sassbalint commented 2 years ago

An idea for a probably not perfect workaround.

Change line 136 of emCompound.py

    wd.compound = '#'.join(split_at(wd.lemma, boundaries))

to the following:

    try:
        wd.compound = '#'.join(split_at(wd.lemma, boundaries))
    except IndexError:
        wd.compound = wd.lemma

gpetho commented 2 years ago

If we ignore the technical side (the bug) for the moment, what do we exactly want the output to be in these cases?

I would say ENSZ-megbízott (and hyphenated compounds in general) should be treated as a compound, so I think compound should be either ENSZ#meg#bízott or ENSZ-#meg#bízott or ENSZ#-meg#bízott.
-megerősödve is a spelling mistake, so it isn't really relevant for us. More generally, tokens that start or end with a hyphen are typically coordinated compound components rather than full compounds, so I would say the correct way to deal with them is to leave out the starting or final hyphen from compound and see whether the remaining components are themselves compounds. So for example in tánc- és illemtanár, the token tánc- should be compound='tánc', and in torlósugár- és hiperszonikus hajtóművek, torlósugár- should be compound='torló#sugár'. (It would be nice to be able to reconstruct the whole compound in these cases, i.e. tánctanár and torlósugár-hajtómű. This sounds like an interesting small project the results of which could be relevant for coordination and ellipsis researchers.) Anyway, since we can't tell by looking at anas that -megerősödve isn't really a word, I think it should also be tagged as meg#erősödik.
I don't know if viselkedik/kinéz should even be a single token, and it certainly isn't a compound in the normal sense as a whole. So I think all tokens that contain a / (similarly to those that contain a +) can be ignored by emCompound, as they don't seem to be valid compounds. In other words, I wouldn't want to assign the compound value of viselked#ki#néz or anything like this to this spurious token.

ril-lexknowrep / emCompound

problem with preverb in the second part of a compound #4