timarkh / uniparser-morph

Rule-based, linguist-friendly (and rather slow) morphological analysis
MIT License
5 stars 2 forks source link

How to handle productive POS-changing derivation? #6

Closed fmatter closed 1 year ago

fmatter commented 2 years ago

Cariban languages have a lot of productive, word class changing derivational morphology. I understand that I can model it in paradigms.txt and treat it like inflectional morphology. However, I would like the grammatical tags and the lemmata to be updated accordingly.

I can see that there is a derivations.txt file for productive derivations, and a lex_rules.txt file allowing modification of lemmata. However, both of these files are not documented, and the examples I found 'in the wild' did not make things clearer for me. How does this Albanian nominalizer work? And these lex rules all seem to only add things to entries.

Here's what I would like to know how to do for Yawarana (it's fine if it's not possible, too!): the verb yarika 'to laugh at' has the grammatical tag V:

-lexeme
 lex: yarika
 stem: .yarika.
 paradigm: verb_nmlz
 paradigm: verb
 trans_en: laugh at
 gloss: laugh.at
 gramm: V

It can occur with inflectional (or, non-POS-changing) morphology, and has grammatical tags added appropriately:

-paradigm: verb
 -flex: u.<.>
  gramm: 1
  gloss: 1
 paradigm: verb_suff

-paradigm: verb_suff
 -flex: .se
  gloss: PST
  gramm: pst

Result:

<Wordform object>
uyarikase
yarika; 1,V,pst
u-yarika-se
1-laugh.at-PST
trans_en    laugh at

When it occurs with what I've for now modelled in paradigms.txt

-paradigm: verb_nmlz
 -flex: .topo
  gramm: N
  gloss: CIRC.NMLZ

...then the N tag is added:

<Wordform object>
yarikatopo
yarika; N,V
yarika-topo
laugh.at-CIRC.NMLZ
trans_en    laugh at

yarikatopo means 'something to laugh at' and is a noun. How can I reflect this in the gramm field (and in lex and possibly trans_en, too)?

fmatter commented 2 years ago

Ok, so I've found the deriv-link: X functionality :)

timarkh commented 2 years ago

Sorry again for making you wait! One of the reasons is that this derivation thing is actually something half-baked, and it took me some time to remember how it works and why it does what it does.

First, a remark about lex_rules: they work just fine (unlike the derivations), but you're right, they can only add values to analyses at a post-processing stage. I actually do use them with some productive derivations, but mainly just to add a separate, non-compositional translation to a combination of a stem and an affix. E.g. in Udmurt, 'write' + passive can mean 'sign' or 'subscribe' (alongside regular 'be written'), so if you have translations in your analyzer, you would want to have that somewhere. In some cases, I add a 'secondary lemma', so that e.g. a word glossed as write-PASS-PST1.SG would have write-INF as its lemma, but also write-PASS-INF as a secondary lemma (which is a separate field). Then, if your search platform allows it, you could search for just write-INF and find all instances, including the passive ones, or for write-PASS-INF and find only the passive ones. That makes sense if you want to disregard productive derivations in your lemmatization, while the descriptive tradition for your language always includes the derivations in the lemma (so many corpus users would expect them to be there).

Now, the conceptual objective of derivations.txt is to make new lexical items out of existing ones. You can think of each derivation as a tool that, applied to lexemes in the lexemes.txt dictionary, produces new lexemes that are automatically added to the dictionary. Of course, in reality that would take too much space in a derivation-rich language, so they are treated in a similar way to morphemes, but that's the idea. So if the derivations worked fine, they would probably be your tool of choice for the task you described above. The problem is, they do not always work as expected, so you have to check every time. The most serious limitation right now is that you cannot apply multiple derivations to a lexeme.

Given that, there are some problems with your pull requests:

  1. The stems are indeed supposed to be separated by the pipe sign. The reason is that the format of the stem field in derivations follows the rules for lexemes, not morphemes. (I'll say it once more: it was really stupid of me to assign different interpretations to the pipe sign in different files. But it is what it is now.) The lexeme produced by the derivation might belong to a grammatical class that has multiple stem allomorphs. E.g. in this Albanian nominalizing derivation, you take a verb that has multiple stems, and you make a noun that has two stems. You use the verb's 3rd stem to produce the new lemma, the same stem to produce the nominalization's one stem and the verb's 0th stem to produce the nominalization's other stem. You could also have free variants of each of the new stems separated by // inside the stem field. If you replace | with //, you can only produce lexemes that have only one stem allomorph. The question is then, how do you split the derivation into morphemes if you want it to be properly glossed? I guess we should use the & here if we follow the lexemes.txt format. But I'm almost sure it's not going to work as is, so I would need to implement that first.
  2. Regarding what you would like to be carried over to the new lexeme. My idea was that by default you want to get rid of the values stored in the source lexeme because now you have a new one. E.g. its translation would certainly be different now. However, I see a point in leaving something like an id. But maybe it would be a good idea to say it explicitly in a derivation that you want to save some data from the source lexemes.

Could you maybe attach some excerpt from your grammar as a MWE and outline your ideas about how that should work? Or we could talk over Zoom sometime next week.

fmatter commented 2 years ago

Sorry for the long wait, too! So I finally got around to putting together a MWE for my productive derivation issue. Basically, you can adverbialize verbs with -se 'PTCP', and you can nominalize adverbs with -mï 'AGT.NMLZ'. That works, including changing the POS. However, these two productive processes can also be combined, so -se-mï will turn a verb into an agentive noun. How do I model that? The MWE shows what I've achieved so far.

uniparser_mwe.zip

fmatter commented 1 year ago

Any news on this? Otherwise, I think I'll just model a single suffix -semï and then add some postprocessing in my parser.

timarkh commented 1 year ago

I've been fixing some derivations-related stuff for the past couple of weeks, and it seems your example will work now, both ways. There was a bug that prevented some sequences of derivational affixes from attaching to the stem in the correct order, and another bug that made it impossible to split a derivation into several affixes.

The first bug was actually a by-product of a conceptual problem with derivations vs. inflections. With inflections, I expect the user to know in advance where exactly some morphology may appear inside or around the stem, including whether stems can have prefixes, suffixes or both. These places are designated by dots in the stem and the paradigms. However, my initial idea with derivations was that, in many languages, you have exclusively suffixing inflectional morphology, but some derivations may be prefixing. So you probably wouldn't want to have a leading dot in all of your stems and corresponding dots in all of your inflectional affixes, only to be able to accomodate a couple of productive prefixing derivations. It means that, in a language like that, you would have a stem XXX. with inflectional suffixes .a, .b, .c and derivational prefixes like p.. In that case, you would need to apply different combination rules to XXX. and p., because if you treat p. as an inflection, they wouldn't combine. So basically my algorithm was "When combining a derivation with a stem, start with the derivation and ignore the absence of a leading dot in the stem if it isn't there." One of the bugs that followed was that sometimes when two suffixing derivations in a row attached to a stem, one of them would turn to a prefix. I fixed that particular problem, but I'm afraid it's still a bit of a mess in other respects. Maybe I'll just get rid of this "simplification" and require that the derivations are treated like inflectional affixes in terms of leading/trailing dots.

There are a couple of corrections you have to make in your files though:

Hope that helps.

fmatter commented 1 year ago

Oh, now even

-deriv-type: V-se
 lex: [.]se
 stem: [.]se<.>
 paradigm: adv
 gramm: adv,ptcp
 gloss: PTCP
 id: septcp

-deriv-type: A-mi
 lex: [.]+mï
 stem: [.]mï<.>
 paradigm: noun
 gramm: n
 gloss: NMLZ
 id: minmlz

works, without having to specify [.]se&mï.

this is great, thanks!