google-research / turkish-morphology

A two-level morphological analyzer for Turkish.
Apache License 2.0
166 stars 28 forks source link

Derivational Affixes Problem #8

Closed Mukhammadsaid19 closed 2 years ago

Mukhammadsaid19 commented 2 years ago

Hello, Dear Developers!

I can't understand why some derivational affixes are universally appended to certain categories, even if the new complex word makes sense or not.

Examples: berber + lAn + mAk = berberlenmek.

What is your motivation to allow it?

Should a morphological analyzer be concerned about underlying semantics or it is a job of other application to distinguish does it make sense or not? Isn't the generalization of such affixes the same as allowing words like "untake" or "unpay" in English?

ozturel commented 2 years ago

Thanks for your question!

I think the Turkish example you give is a bit more nuanced than, e.g. English "untake" and "unhappy".

Our motivation is to consider any syntactically productive suffix, which also alters the semantics of the word when appended through affixation, as part of the affix inventory. That being said, when the context that the word appears in is not present, it is sometimes difficult to decide whether affixation of a certain derivational morpheme (e.g. "lAn" in your example) ends up creating a semantically sound Turkish word, or not.

For instance, one might also argue that duvar + lAn + mAk = duvarlanmak does not make sense (just like your 'berberlenmek' example). Yet, with a simple Google search, we can find the creative usage of the word within context.

Now, let's consider how this seemingly unrestricted analysis capacity might help us in practice: Let's say we are annotating the morphology and PoS on a corpus of Turkish text that varies in genre (as in [1]). We would like to be able to generate an analysis for creative uses of inflected word forms. Since such affixes are productive, creative uses occur in text that is coming from web, literature and news genres. Once we get the set of analysis for such words, a separate tagger can be trained on top, with supervision of morphology annotations. If such derivations do not occur in the corpora that the tagger is trained on, then the hypotheses for them in prediction will expected to be pruned.

I hope this helps!

[1] https://aclanthology.org/2020.lrec-1.634.pdf

Mukhammadsaid19 commented 2 years ago

Oh, I think I'm getting this. So, creating a big skeleton of language and pruning its parts when needed is a useful feature of morphological analyzer. In Uzbek we also have these -la, -lash, -lan, -lashtir and -lantir (no vowel harmony) as productive derivational affixes. In this fashion, each productive affix should be defined as a rule in FST... Cool. Thanks for the answer)

P.S. As for spelling correction application, should the analyzer be pretrained on valid dictionary entries before using it in practice? Is it achieved through weights on edges? Or some HMM?

ozturel commented 2 years ago

I don't think there is a definitive answer on how to build spelling correctors or morphological disambiguators. It depends on how you would like to model your tagger. But, I'm leaving some references from literature below to serve as examples.

Morphological disambiguation:

Spelling correction:

Mukhammadsaid19 commented 2 years ago

Wow! That's so good! Thank you for such an elaborate response! We really appreciate it.

P.S. The reason I was asking the question is that we are developing morphological analyzer for Uzbek (me and couple of my friends). Our language is way underdeveloped in terms of NLP compared to other languages (we don't even have FSM spellchecker). In attempt to fix this, we decided to make an analyzer. We use TRmorph by Ç. Çöltekin [1], Kazakh analyzer [2], and your analyzer as reference models. Hopefully, soon we will be able to publish FST-based morph analyzer and make it open-source. Your answers really helped us. Thank you.

[1] http://www.lrec-conf.org/proceedings/lrec2010/pdf/109_Paper.pdf [2] https://aclanthology.org/W14-2806.pdf

ozturel commented 2 years ago

I hope it was useful! Closing this issue for now. Thanks!