Create a sandhi (assimilation) corpus

juditacs / morph-segmentation

Experimenting with supervised morphological segmentation

MIT License

6 stars 5 forks source link

Open juditacs opened 7 years ago

juditacs commented 7 years ago

Create a sandhi corpus from morphologically analyzed Hungarian text.

I have two ideas, please let me know what you think. @e9t @kornai @DavidNemeskey

take a few inflection rules that cause assimilation such as the instrumental case and extract words with those inflections
find words where the lemma is not a substring of the inflected word. I'm checking this option right now, it might introduce many false positives

juditacs commented 7 years ago

Option 2. has false negatives, because assimilation occurs in the suffix :(

virág+val  ->   virággal

juditacs commented 7 years ago

I counted how many times Hungarian words exhibit low vowel lengthening, the instrumental case and lemma change (when the lemma is not a substring of the word). The methods are available here: https://github.com/juditacs/morph-segmentation/blob/master/morph_seg/preprocessing/create_sandhi_corpus.py#L43

Here are the results (word types):

juditacs commented 7 years ago

We discussed two variations for the Hungarian input:

replace accented vowels with non-accented+number (Proszeky code). For example á --> a1
create abstract morpheme corpus. Allomorphs should be merged in to a single abstract morpheme. For example:

házban ---> házBAN
kerékben ---> kerékBAN