timarkh / uniparser-morph

Rule-based, linguist-friendly (and rather slow) morphological analysis
MIT License
5 stars 2 forks source link

Morphologically complex clitics and clitic behavior in general #2

Closed fmatter closed 2 years ago

fmatter commented 2 years ago

While there is a lot of ongoing discussion about what a "clitic" is, in most approaches it is defined along the lines of "grammatically independent, but phonologically bound". E.g. when using the notion of p-word and g-word, a clitic is its own g-word, but not its own p-word.

This implies that a clitic belongs to a lexeme of its own, in turn meaning that clitics can potentially have inflectional morphology of their own (or have morphologically complex stems). And we do find instances of this in "the wild"; the following Tiriyó example shows two multi-morpheme enclitics in a row: the morphologically complex stem hkao 'in water', followed by the inflected form nai of the stem a(i) 'to be':

paru=hka-o=n-ai     i-pata
P.=AQU-LOC=3Sa-COP  3-village
'He lives on the Paru river.' (Meira 1999: 393)

In uniparser, I can model the postposition and the copula as lexemes like this (ignoring their phonological dependence):

-lexeme
 lex: a
 stem: .a.//.ai.
 paradigm: cop
 trans_en: be
 gloss: COP

-lexeme
 lex: hkao
 stem: hk&ao.
 gloss: AQU&LOC
 trans_en: in water
 paradigm: zero
 gramm: aqu,loc
-paradigm: cop
 -flex: n.
  gramm: 3
  gloss: 3Sa

-paradigm: zero
 -flex: .
  gramm:

where the copula has a paradigm containing n-, and the postposition is morphologically complex (&). This yields expected hk-ao 'AQU-LOC' and n-ai '3Sa-COP'. The current implementation of clitics, however, does not allow for either of these functionalities, only accepting single-morpheme clitics like the following:

-clitic
 lex: po
 stem: po
 type: en
 gramm: loc
 gloss: LOC
 trans_en: LOC

I would love to be able to (in addition to the simple clitic definition above) also write things like:

-clitic
 lex: a
 stem: .a.//.ai.
 type: en
 paradigm: cop
 trans_en: be
 gloss: COP

-clitic
 lex: hkao
 stem: hk&ao
 type: en
 gloss: AQU&LOC
 trans_en: in water
 gramm: aqu,loc

Of course, most clitics are morphologically simple; for these cases the current functionality where no . has to be added to the stem and no (zero) paradigm has to be defined is perfectly suitable.

Judging from the uniparser input format for clitics and lexemes, as well as from the existence of placeholders for paradigms in the Clitic class (and much shared/duplicate code), clitics are already treated more like lexemes rather than like inflection. This leads to the question of whether it would make sense to make the Clitic class inherit from Lexeme? I don't understand enough of the inner workings of uniparser to figure this out, just a thought. I strongly suspect it's not that simple.

Maybe implementing only the second use case with & would be simpler? Judging from a discussion I started on the lingtyp mailing list, languages with morphologically complex clitics only have a few of them (e.g., Tiriyó does not have countless cliticized forms of the copula, and a restricted number of morphologically complex postpositions).

timarkh commented 2 years ago

Thanks for your useful suggestions! What I implemented for now is & in clitics described inside clitics.txt and a custom separator for affixes (so you can have an "affix" separated by =). The details on that latter thing are here.

Regarding class inheritance: you are right, it would be not that simple. I mean, just making Clitic a subclass of Lexeme would be easy, but the hard part would be to describe that multiple-lexemes-in-one-token relationship in Wordform instances. I did something like that earlier with "subwords" , however this solution looks rather ugly.

fmatter commented 2 years ago

Thank you! I've tried to implement the above Tiriyó example as follows:

clitics:

-clitic
 lex: a
 stem: n&a&i
 paradigm: cop
 trans_en: s/he is
 type: en
 gloss: 3&COP&UNCERT

lexemes:

-lexeme
 lex: paru
 stem: paru.
 gloss: P.
 trans_en: paru river
 paradigm: n
 gramm: N

paradigms:

-paradigm: cop
 -flex: n.
  gramm: 3
  gloss: 3Sa

-paradigm: n
 -flex: .hk|ao
  gramm: aqu,loc
  gloss: AQU|LOC
  sep: =
  trans_en: in water
 -flex: .
  gramm:

Output:

<Wordform object>
paruhkaonai
paru+a; N,aqu,loc
paru=hk-ao=nai
P.=AQU-LOC=3-COP-UNCERT
trans_en    s/he is
timarkh commented 2 years ago

Thank you, I fixed both issues. Still no paradigms available for clitics though (so your paradigm: cop has no effect). P.S. Sorry, I just realized the link to the sep: = description was wrong, but obviously you've already found the right place.

fmatter commented 2 years ago

I just noticed the superfluous paradigm today. Thanks for the fix -- I see that IDs now work, too!

fmatter commented 2 years ago

I think my clitic needs are covered, thanks a bunch.