Open juditacs opened 7 years ago
Option 2. has false negatives, because assimilation occurs in the suffix :(
virág+val -> virággal
I counted how many times Hungarian words exhibit low vowel lengthening, the instrumental case and lemma change (when the lemma is not a substring of the word). The methods are available here: https://github.com/juditacs/morph-segmentation/blob/master/morph_seg/preprocessing/create_sandhi_corpus.py#L43
Here are the results (word types):
phenomena | count | ratio |
---|---|---|
low_vowel_lengthening | 20709 | 0.05436331991904173 |
instrumental | 17169 | 0.04507044471920554 |
lemma_change | 40977 | 0.10756896809708692 |
all word types | 380937 |
We discussed two variations for the Hungarian input:
á --> a1
házban ---> házBAN
kerékben ---> kerékBAN
Create a sandhi corpus from morphologically analyzed Hungarian text.
I have two ideas, please let me know what you think. @e9t @kornai @DavidNemeskey