juditacs / morph-segmentation

Experimenting with supervised morphological segmentation
MIT License
6 stars 5 forks source link

Create a sandhi (assimilation) corpus #20

Open juditacs opened 7 years ago

juditacs commented 7 years ago

Create a sandhi corpus from morphologically analyzed Hungarian text.

I have two ideas, please let me know what you think. @e9t @kornai @DavidNemeskey

  1. take a few inflection rules that cause assimilation such as the instrumental case and extract words with those inflections
  2. find words where the lemma is not a substring of the inflected word. I'm checking this option right now, it might introduce many false positives
juditacs commented 7 years ago

Option 2. has false negatives, because assimilation occurs in the suffix :(

virág+val  ->   virággal
juditacs commented 7 years ago

I counted how many times Hungarian words exhibit low vowel lengthening, the instrumental case and lemma change (when the lemma is not a substring of the word). The methods are available here: https://github.com/juditacs/morph-segmentation/blob/master/morph_seg/preprocessing/create_sandhi_corpus.py#L43

Here are the results (word types):

phenomena count ratio
low_vowel_lengthening 20709 0.05436331991904173
instrumental 17169 0.04507044471920554
lemma_change 40977 0.10756896809708692
all word types 380937
juditacs commented 7 years ago

We discussed two variations for the Hungarian input:

  1. replace accented vowels with non-accented+number (Proszeky code). For example á --> a1
  2. create abstract morpheme corpus. Allomorphs should be merged in to a single abstract morpheme. For example:
házban ---> házBAN
kerékben ---> kerékBAN