Closed fmatter closed 1 year ago
Ok, so I've found the deriv-link: X
functionality :)
Sorry again for making you wait! One of the reasons is that this derivation thing is actually something half-baked, and it took me some time to remember how it works and why it does what it does.
First, a remark about lex_rules
: they work just fine (unlike the derivations), but you're right, they can only add values to analyses at a post-processing stage. I actually do use them with some productive derivations, but mainly just to add a separate, non-compositional translation to a combination of a stem and an affix. E.g. in Udmurt, 'write' + passive can mean 'sign' or 'subscribe' (alongside regular 'be written'), so if you have translations in your analyzer, you would want to have that somewhere. In some cases, I add a 'secondary lemma', so that e.g. a word glossed as write-PASS-PST1.SG
would have write-INF
as its lemma, but also write-PASS-INF
as a secondary lemma (which is a separate field). Then, if your search platform allows it, you could search for just write-INF
and find all instances, including the passive ones, or for write-PASS-INF
and find only the passive ones. That makes sense if you want to disregard productive derivations in your lemmatization, while the descriptive tradition for your language always includes the derivations in the lemma (so many corpus users would expect them to be there).
Now, the conceptual objective of derivations.txt
is to make new lexical items out of existing ones. You can think of each derivation as a tool that, applied to lexemes in the lexemes.txt
dictionary, produces new lexemes that are automatically added to the dictionary. Of course, in reality that would take too much space in a derivation-rich language, so they are treated in a similar way to morphemes, but that's the idea. So if the derivations worked fine, they would probably be your tool of choice for the task you described above. The problem is, they do not always work as expected, so you have to check every time. The most serious limitation right now is that you cannot apply multiple derivations to a lexeme.
Given that, there are some problems with your pull requests:
stem
field in derivations follows the rules for lexemes, not morphemes. (I'll say it once more: it was really stupid of me to assign different interpretations to the pipe sign in different files. But it is what it is now.) The lexeme produced by the derivation might belong to a grammatical class that has multiple stem allomorphs. E.g. in this Albanian nominalizing derivation, you take a verb that has multiple stems, and you make a noun that has two stems. You use the verb's 3rd stem to produce the new lemma, the same stem to produce the nominalization's one stem and the verb's 0th stem to produce the nominalization's other stem. You could also have free variants of each of the new stems separated by //
inside the stem
field. If you replace |
with //
, you can only produce lexemes that have only one stem allomorph. The question is then, how do you split the derivation into morphemes if you want it to be properly glossed? I guess we should use the &
here if we follow the lexemes.txt
format. But I'm almost sure it's not going to work as is, so I would need to implement that first.id
. But maybe it would be a good idea to say it explicitly in a derivation that you want to save some data from the source lexemes.Could you maybe attach some excerpt from your grammar as a MWE and outline your ideas about how that should work? Or we could talk over Zoom sometime next week.
Sorry for the long wait, too! So I finally got around to putting together a MWE for my productive derivation issue. Basically, you can adverbialize verbs with -se 'PTCP', and you can nominalize adverbs with -mï 'AGT.NMLZ'. That works, including changing the POS. However, these two productive processes can also be combined, so -se-mï will turn a verb into an agentive noun. How do I model that? The MWE shows what I've achieved so far.
Any news on this? Otherwise, I think I'll just model a single suffix -semï and then add some postprocessing in my parser.
I've been fixing some derivations-related stuff for the past couple of weeks, and it seems your example will work now, both ways. There was a bug that prevented some sequences of derivational affixes from attaching to the stem in the correct order, and another bug that made it impossible to split a derivation into several affixes.
The first bug was actually a by-product of a conceptual problem with derivations vs. inflections. With inflections, I expect the user to know in advance where exactly some morphology may appear inside or around the stem, including whether stems can have prefixes, suffixes or both. These places are designated by dots in the stem and the paradigms. However, my initial idea with derivations was that, in many languages, you have exclusively suffixing inflectional morphology, but some derivations may be prefixing. So you probably wouldn't want to have a leading dot in all of your stems and corresponding dots in all of your inflectional affixes, only to be able to accomodate a couple of productive prefixing derivations. It means that, in a language like that, you would have a stem XXX.
with inflectional suffixes .a
, .b
, .c
and derivational prefixes like p.
. In that case, you would need to apply different combination rules to XXX.
and p.
, because if you treat p.
as an inflection, they wouldn't combine. So basically my algorithm was "When combining a derivation with a stem, start with the derivation and ignore the absence of a leading dot in the stem if it isn't there." One of the bugs that followed was that sometimes when two suffixing derivations in a row attached to a stem, one of them would turn to a prefix. I fixed that particular problem, but I'm afraid it's still a bit of a mess in other respects. Maybe I'll just get rid of this "simplification" and require that the derivations are treated like inflectional affixes in terms of leading/trailing dots.
There are a couple of corrections you have to make in your files though:
If you choose the second strategy (lump all derivational affixes together), you would have to use the conventions for lexemes.txt
. There, &
is used to split affixes and glosses, while |
only separates stem allomorphs. (A derived lexeme can have multiple stem allomorphs, just as a normal one.) So something like this:
-deriv-type: V-semi
lex: [.]+semï
stem: [.]se&mï.//[.]che&mï.
paradigm: noun
gramm: n
gloss: PTCP&NMLZ
id: septcp,minmlz
You don't need the <.>
segments in the affixes if you only have links to derivations, but not to other inflectional paradigms. So:
-paradigm: v_nmlz
-flex: .
gramm:
deriv-link: V-se
Hope that helps.
Oh, now even
-deriv-type: V-se
lex: [.]se
stem: [.]se<.>
paradigm: adv
gramm: adv,ptcp
gloss: PTCP
id: septcp
-deriv-type: A-mi
lex: [.]+mï
stem: [.]mï<.>
paradigm: noun
gramm: n
gloss: NMLZ
id: minmlz
works, without having to specify [.]se&mï.
this is great, thanks!
Cariban languages have a lot of productive, word class changing derivational morphology. I understand that I can model it in
paradigms.txt
and treat it like inflectional morphology. However, I would like the grammatical tags and the lemmata to be updated accordingly.I can see that there is a
derivations.txt
file for productive derivations, and alex_rules.txt
file allowing modification of lemmata. However, both of these files are not documented, and the examples I found 'in the wild' did not make things clearer for me. How does this Albanian nominalizer work? And these lex rules all seem to only add things to entries.Here's what I would like to know how to do for Yawarana (it's fine if it's not possible, too!): the verb
yarika
'to laugh at' has the grammatical tagV
:It can occur with inflectional (or, non-POS-changing) morphology, and has grammatical tags added appropriately:
Result:
When it occurs with what I've for now modelled in
paradigms.txt
...then the
N
tag is added:yarikatopo means 'something to laugh at' and is a noun. How can I reflect this in the
gramm
field (and inlex
and possiblytrans_en
, too)?