stem_model intro: indeclineables

funderburkjim commented 6 years ago

A particular program (stem_model.py) is used to interpret the inflection information for nouns, adjectives and indeclineables that is derived from certain meta-information present in the revised MW digitization.

This inflection information is present in the lexnorm-all2.txt file, whose format whose described in #2.

Before describing the interpretation that stem_model does, it may be useful to look at a few examples of the lexnorm input.

funderburkjim commented 6 years ago

lexnorm-all2 samples

L	key1	key2	lexnorm
2	akAra	a-kAra	m
5	a	a	LEXID-pron,STEM-idam
7	a	a	m
8	afRin	a-fRin	m:f:n
10	aMSa	aMSa	m
20	aMSakaraRa	aMSa-karaRa	n
21	aMSakalpanA	aMSa-kalpanA	f
39	aMSaka	aMSaka	m:f#ikA:n

The last field 'lexnorm' contains a normalization of the information marked within a <lex> tag of the digitization. Here are extracts from the digitization corresponding to some entries of the table.

L	body	`<info lex=>`
2	`<s>a—kAra</s>` ¦ `<lex>m.</lex>` the letter or sound `<s>a</s>`.	`<info lex="m"/>`
5	`<s>a</s>` `<hom>`4`</hom>` ¦ the base of some pronouns and `<ab>pronom.</ab>` forms, in `<s>asya</s>`, `<s>atra</s>`, &c.	`<info lexcat="LEXID=pron,STEM=idam"/>`
8	`<s>a-fRin</s>1 ¦ 1<lex>mfn.</lex>1 free from debt, 1<ls>L.</ls>`	`<info lex="m:f:n"/>`
39	`<hom>`1.`</hom>` `<s>aMSaka</s>` ¦ `<lex>mf(<s>ikA</s>)n.</lex>` (`<ab>ifc.</ab>`) forming part.	`<info lex="m:f#ikA:n"/>`

Note:

The content of the info field in the digitization corresponds to the lexnorm field of the table
This info field is usually just a standardization of what appears within <lex> tag.
- In such cases, the standardization was accomplished by applying lots of regular expression goodness - a tedious but fairly straightforward task.

In the pronoun example, the lexcat field was originally generated by Chandrashekar and Peter. Note that there are several entries elsewhere in the dictionary identified as pronouns with stem idam

4749    ayam    ayam    LEXID=pron,STEM=idam
28537   i   i   LEXID=pron,STEM=idam
28801   idam    idam    LEXID=pron,STEM=idam
29337    iyam    iyam    LEXID=pron,STEM=idam
40112    ena ena LEXID=pron,STEM=idam,etad

funderburkjim commented 6 years ago

stem_model overview

The stem_model program aims to assign appropriate (stem,model) pairs for each record in lexnorm-all2.

It does this in several submodules, each of which deals with a restricted subset of the records. Each submodule scans all the records of lexnorm-all2, and for each record

skips the record if it has already been parsed by a previous submodule
Uses key2 and the lexnorm field to decide whether it can handle the record
If it can handle the record, it generates a stem,model pair for each component of the lexnorm field.

This process will be become clearer as we proceed through the submodules. But let's start with the very simplest submodule.

funderburkjim commented 6 years ago

'pure' indeclineables

One of the simplest lexnorm fields is the one which has only the one component 'ind'. For example,

70283 ca ca ind

The 'model_ind' module of stem_model identifies just such cases. It outputs all such cases to a a file named 'ind.txt'. The line corresponding to 'ca' in ind.txt is

ind ca 70283,ca

This stem model file has 3 fields in each line:

model
stem (usually 'key2', but not always -- later examples will show how 'key2' may be altered in a stem-model file
the L,key1 pairs with this model and stem. In our example, 70283,ca
- sometimes, a given model-stem will be inferred in more than 1 record of lexnorm-all2; in such cases there will be additional L,key1 pairs (separated by ~~semicolon~~ colon) in this third field.

ind.txt

This ind.txt file has the stems with model 'ind'.

Additional entries in ind.txt

When we later apply different submodules, there will be additional entries in ind.txt. For example, 98580 dvibarhAs dvi-barhAs n:ind.

MW text: dvi—barhās n. and ind., doubly close or thick or strong

dvi-barhAs can be an indeclineable, according to MW. But it can also be declined as a neuter noun. Since it has these two forms, the 'model_ind' submodule of stem_model skips it. Another submodule, not yet written, will have to handle it. When it does, then we will have another entry in ind.txt: ind dvi-barhAs 98580,dvibarhAs

funderburkjim commented 6 years ago

Nothing to decline

Since we thus far have just filtered out (some) of the indeclineables, and since there is no declension of indeclineables, there's not more to do here, at least for now.

funderburkjim commented 6 years ago

Why the stem-model approach?

This is the approach which seems most reasonable based upon my study of recent grammars.

Based upon my limited understanding of the Panini approach (primarily gleaned from Scharf's gshell program) has different emphases.

The next stem_model submodule (model_m_a) and the associated declension module, will start to flesh out the stem_model approach.

funderburkjim commented 6 years ago

outputs/nominals/ind.txt

The indeclineables are put into a format similar to normal declineables by program:

# in inflect directory
python decline_file.py ../inputs/nominals/ind.txt ../outputs/nominals/ind.txt

This is for anticipated convenience when later creating databases. A typical output is

model key2     key1
ind      a-kAle  akAle



The decline_file program and the general output format will be described in the next issue for
m_a model.

sanskrit-lexicon / MWinflect