citations ending in 'a', continue part 1

funderburkjim commented 6 years ago

Our identification of declension models is based on the lexnorm-all2 file which presents MW citations and normalized declension information.

We have already considered

MW headwords ending in 'a' whose lexnorm information is exactly 'm' (#4)
MW headwords ending in 'a' whose lexnorm information is exactly 'n' (#7)
MW headwords ending in 'A' whose lexnorm information is exactly 'f' or 'f#A' (#8).

There are many MW headwords ending in 'a' not included in the first two categories.

the lexnorms considered here.

The stem_model.py program does the filter described here in the submodule named 'model_mfn_a'.

This filter selects those MW headwords ending 'a' whose lexnorm information is one of

'f:n' 40 of these in lexnorm-all2
'm:f' 66 of these
'm:n' 1923 of these
'm:f:n' 36587 of these

lexnorm parsing

For definiteness, consider the large 'm:f:n' category. Each of these represents three stem_model cases. Consider 'aMsya 'm:f:n' (belonging to the shoulder) :

m_a e.g. stem = aMsya, model = m_a
f_A e.g. stem = aMsyA, model = f_A
n_a e.g. stem = aMsya, model = n_a

The current coding represents aMsya m:f:n by a line in 3 inflection input files:

m_a.txt : m_a aMsya 106,aMsya
f_a1.txt: f_A aMsyA 106,aMsya
n_a.txt : n_a aMsya 106,aMsya

Note the interpretation for the feminine. The lexnorm 'm:f:n' corresponds to the text form 'mfn.' We assume that the stem for the masculine and neuter are just the citation form, aMsya. We also assume that the stem for the feminine form is aMsyA, obtained from the citation form by replacing the final a by A.

In this case, the stem interpretations are surely right; they are so obvious that they hardly need to be mentioned.

Later, we'll encounter cases where the lexnorm interpretations are less obvious, athough such cases are rare.

funderburkjim commented 6 years ago

revisions to the inflection input files

In #4, we mentioned that there were 49344 records in the m_a.txt stem_model input file; these were cases where the lexnorm information as exactly 'm'.

The current model_mfn_a stem_model filter now also puts cases where lexnorm contains an 'm' as well as (a) 'f' or (b) 'n' or (c) both 'f' and 'n'. For such lexnorms (providing the citation ends in 'a'), For any such case, we are modifying m_a.txt. The modification could be of two types:

new stem

aMsya appears only once in lexnorm-all2.txt, and with lexnorm = 'm:f:n'. Thus, it did not appear in m_a.txt from the m_a filter describe in #4. So, it appears as additional stem-model entry in m_a.txt We are putting cases like m_a aMsya 106,aMsya into the m_a file.

additional instance of a stem already present in the m_a filter.

There are over 5420 of these. The first one, alphabetically, is akaca. In lexnorm-all2 there are two records for akaca:

138 akaca   a-kaca  m:f:n
140 akaca   a-kaca  m

The L=140 instance, with simple 'm' lexnorm', generates a line of m_a.txt:

m_a a-kaca  140,akaca

So the L=138 instance, with lexnorm 'm:f:n' has the same stem and model (for masculine) as the L=140 instance; we just augment the last field in the m_a.txt file; effectively this changes the above line of m_a.txt for akaca to:

m_a a-kaca  138,akaca:140,akaca

By similar reasoning, we can understand that the current filter adds a new entry to f_a1.txt:

f_A a-kacA  138,akaca

and to n_a.txt:

n_a a-kaca  138,akaca

funderburkjim commented 6 years ago

counting the number of stem_model cases

The number of stem_model cases now in

m_a.txt is 82733 (compare to 49344 #4 )
n_a.txt is 65721 (compare to 31093 #7)
f_a1.txt is 51575 (compare to 17265 #8)

All in all, we are starting with 197176 cases in lexnorm-all2. The number remaining to be parsed is 32442. So in percentage terms, we've handled about 83% of the cases.

no change to declension algorithms by the work described in this issue.

We already know how to decline m_a, n_a, and f_A models.

gasyoun commented 6 years ago

By similar reasoning, we can understand that the current filter adds a new entry to f_a1.txt:

Yes, that's easy to understand as such.

Because

138 akaca   a-kaca  m:f:n
140 akaca   a-kaca  m

does not make sense for your generative purpose, it contains a repetitive data.

So in percentage terms, we've handled about 83% of the cases.

Adore stats. Actually not only adore. I use your stats at my Sanskrit classes as well. Really. So do not stop.

sanskrit-lexicon / MWinflect