Open funderburkjim opened 6 years ago
In #4, we mentioned that there were 49344 records in the m_a.txt stem_model input file; these were cases where the lexnorm information as exactly 'm'.
The current model_mfn_a stem_model filter now also puts cases where lexnorm contains an 'm' as well as (a) 'f' or (b) 'n' or (c) both 'f' and 'n'. For such lexnorms (providing the citation ends in 'a'), For any such case, we are modifying m_a.txt. The modification could be of two types:
aMsya appears only once in lexnorm-all2.txt, and with lexnorm = 'm:f:n'.
Thus, it did not appear in m_a.txt from the m_a filter describe in #4.
So, it appears as additional stem-model entry in m_a.txt
We are putting cases like m_a aMsya 106,aMsya
into the m_a file.
There are over 5420 of these. The first one, alphabetically, is akaca. In lexnorm-all2 there are two records for akaca:
138 akaca a-kaca m:f:n
140 akaca a-kaca m
The L=140 instance, with simple 'm' lexnorm', generates a line of m_a.txt:
m_a a-kaca 140,akaca
So the L=138 instance, with lexnorm 'm:f:n' has the same stem and model (for masculine) as the L=140 instance; we just augment the last field in the m_a.txt file; effectively this changes the above line of m_a.txt for akaca to:
m_a a-kaca 138,akaca:140,akaca
By similar reasoning, we can understand that the current filter adds a new entry to f_a1.txt:
f_A a-kacA 138,akaca
and to n_a.txt:
n_a a-kaca 138,akaca
The number of stem_model cases now in
All in all, we are starting with 197176 cases in lexnorm-all2. The number remaining to be parsed is 32442. So in percentage terms, we've handled about 83% of the cases.
We already know how to decline m_a, n_a, and f_A models.
By similar reasoning, we can understand that the current filter adds a new entry to f_a1.txt:
Yes, that's easy to understand as such.
Because
138 akaca a-kaca m:f:n
140 akaca a-kaca m
does not make sense for your generative purpose, it contains a repetitive data.
So in percentage terms, we've handled about 83% of the cases.
Adore stats. Actually not only adore. I use your stats at my Sanskrit classes as well. Really. So do not stop.
Our identification of declension models is based on the lexnorm-all2 file which presents MW citations and normalized declension information.
We have already considered
There are many MW headwords ending in 'a' not included in the first two categories.
the lexnorms considered here.
The stem_model.py program does the filter described here in the submodule named 'model_mfn_a'.
This filter selects those MW headwords ending 'a' whose lexnorm information is one of
lexnorm parsing
For definiteness, consider the large 'm:f:n' category. Each of these represents three stem_model cases. Consider 'aMsya 'm:f:n' (belonging to the shoulder) :
The current coding represents
aMsya m:f:n
by a line in 3 inflection input files:m_a aMsya 106,aMsya
f_A aMsyA 106,aMsya
n_a aMsya 106,aMsya
Note the interpretation for the feminine. The lexnorm 'm:f:n' corresponds to the text form 'mfn.' We assume that the stem for the masculine and neuter are just the citation form, aMsya. We also assume that the stem for the feminine form is aMsyA, obtained from the citation form by replacing the final a by A.
In this case, the stem interpretations are surely right; they are so obvious that they hardly need to be mentioned.
Later, we'll encounter cases where the lexnorm interpretations are less obvious, athough such cases are rare.