Open funderburkjim opened 6 years ago
L | key1 | key2 | lexnorm |
---|---|---|---|
2 | akAra | a-kAra | m |
5 | a | a | LEXID-pron,STEM-idam |
7 | a | a | m |
8 | afRin | a-fRin | m:f:n |
10 | aMSa | aMSa | m |
20 | aMSakaraRa | aMSa-karaRa | n |
21 | aMSakalpanA | aMSa-kalpanA | f |
39 | aMSaka | aMSaka | m:f#ikA:n |
The last field 'lexnorm' contains a normalization of the information marked within a <lex>
tag of
the digitization. Here are extracts from the digitization corresponding to some entries of the table.
L | body | <info lex=> |
---|---|---|
2 | <s>a—kAra</s> ¦ <lex>m.</lex> the letter or sound <s>a</s> . |
<info lex="m"/> |
5 | <s>a</s> <hom> 4</hom> ¦ the base of some pronouns and <ab>pronom.</ab> forms, in <s>asya</s> , <s>atra</s> , &c. |
<info lexcat="LEXID=pron,STEM=idam"/> |
8 | <s>a-fRin</s>1 ¦ 1<lex>mfn.</lex>1 free from debt, 1<ls>L.</ls> |
<info lex="m:f:n"/> |
39 | <hom> 1.</hom> <s>aMSaka</s> ¦ <lex>mf(<s>ikA</s>)n.</lex> (<ab>ifc.</ab> ) forming part. |
<info lex="m:f#ikA:n"/> |
Note:
<lex>
tag.
idam
4749 ayam ayam LEXID=pron,STEM=idam
28537 i i LEXID=pron,STEM=idam
28801 idam idam LEXID=pron,STEM=idam
29337 iyam iyam LEXID=pron,STEM=idam
40112 ena ena LEXID=pron,STEM=idam,etad
The stem_model program aims to assign appropriate (stem,model) pairs for each record in lexnorm-all2.
It does this in several submodules, each of which deals with a restricted subset of the records. Each submodule scans all the records of lexnorm-all2, and for each record
This process will be become clearer as we proceed through the submodules. But let's start with the very simplest submodule.
One of the simplest lexnorm fields is the one which has only the one component 'ind'. For example,
70283 ca ca ind
The 'model_ind' module of stem_model identifies just such cases. It outputs all such cases to a a file named 'ind.txt'. The line corresponding to 'ca' in ind.txt is
ind ca 70283,ca
This stem model file has 3 fields in each line:
70283,ca
This ind.txt file has the stems with model 'ind'.
When we later apply different submodules, there will be additional entries in ind.txt.
For example, 98580 dvibarhAs dvi-barhAs n:ind
.
MW text: dvi—barhās n. and ind., doubly close or thick or strong
dvi-barhAs can be an indeclineable, according to MW. But it can also be declined as a neuter noun.
Since it has these two forms, the 'model_ind' submodule of stem_model skips it. Another
submodule, not yet written, will have to handle it. When it does, then we will have another entry
in ind.txt:
ind dvi-barhAs 98580,dvibarhAs
Since we thus far have just filtered out (some) of the indeclineables, and since there is no declension of indeclineables, there's not more to do here, at least for now.
This is the approach which seems most reasonable based upon my study of recent grammars.
Based upon my limited understanding of the Panini approach (primarily gleaned from Scharf's gshell program) has different emphases.
The next stem_model submodule (model_m_a) and the associated declension module, will start to flesh out the stem_model approach.
The indeclineables are put into a format similar to normal declineables by program:
# in inflect directory
python decline_file.py ../inputs/nominals/ind.txt ../outputs/nominals/ind.txt
This is for anticipated convenience when later creating databases. A typical output is
model key2 key1
ind a-kAle akAle
The decline_file program and the general output format will be described in the next issue for
m_a model.
A particular program (stem_model.py) is used to interpret the inflection information for nouns, adjectives and indeclineables that is derived from certain meta-information present in the revised MW digitization.
This inflection information is present in the lexnorm-all2.txt file, whose format whose described in #2.
Before describing the interpretation that stem_model does, it may be useful to look at a few examples of the lexnorm input.