ril-lexknowrep / emCompound

An emtsv module to annotate compound boundaries
GNU Lesser General Public License v3.0
2 stars 0 forks source link

problem with preverb in the second part of a compound #4

Open sassbalint opened 2 years ago

sassbalint commented 2 years ago

Phenomenon:

$ echo "úgy viselkedik/kinéz ahogy" | curl -F "file=@-" http://EMTSV_SERVER/tok/morph/pos | python3 emCompound/emCompound
...
  File "../emCompound/emCompound/emCompound.py", line 136, in process_sentence
    wd.compound = '#'.join(split_at(wd.lemma, boundaries))
  File "../emCompound/emCompound/emCompound.py", line 298, in split_at
    raise IndexError(f"index beyond bounds: {indices[-1]} in {in_list}")
IndexError: In "no filename for stream" at 5: index beyond bounds: 23 in viselked/kinéz

Some additional examples:

input text error
"úgy viselkedik/kinéz ahogy" 23 in viselked/kinéz
"az EU- és az ENSZ-megbízott is" index beyond bounds: 14 in ENSZ-megbíz
"töltődtem.Tehát -megerősödve kicsit- hétfőn úgy" index beyond bounds: 12 in -megerősödik
"szerint tudnak dolgozni/kivitelezni." index beyond bounds: 19 in dolgoz/kivitelezik

Problem:

The problem lies in that we have constructions where the first part does not have a preverb while the second part does.

input text
"úgy viselkedik/kinéz ahogy"
"az EU- és az ENSZ-megbízott is"
"töltődtem.Tehát -megerősödve kicsit- hétfőn úgy"
"szerint tudnak dolgozni/kivitelezni."

Gergő, please look at this issue.

sassbalint commented 2 years ago

An idea for a probably not perfect workaround.

Change line 136 of emCompound.py

    wd.compound = '#'.join(split_at(wd.lemma, boundaries))

to the following:

    try:
        wd.compound = '#'.join(split_at(wd.lemma, boundaries))
    except IndexError:
        wd.compound = wd.lemma
gpetho commented 2 years ago

If we ignore the technical side (the bug) for the moment, what do we exactly want the output to be in these cases?