sanskrit-lexicon / hwnorm1

Headword normalization for Cologne dictionaries
0 stars 0 forks source link

A viable 'ending am -> a' normalization rule #12

Open funderburkjim opened 6 years ago

funderburkjim commented 6 years ago

In a recent discussion of the Sanskrit spelling normalization rules, one missing normalization rule was noticed.

The example involved the spelling 'vanam' of the AP (Apte 1957) dictionary. In most dictionaries, these is spelled 'vana', and in AP90 and SKD it is spelled 'vanaM'.

Since one purpose of normalization spelling rules is to limit the impact of minor differences in spelling conventions among the different dictionaries, it would be desirable to have a normalization rule which would yield the normalized form 'vana' for 'vanam'.

The most obvious candidate rule is ending am -> a. But this obvious rule would generate many awkward, undesireable normalizations of the spelling of many roots (with or without prefixes) ending in am . There are many of these, gam, upagam, nam, yam, etc.

The vanam instances is a case of a neuter noun, where the vana normalization is definitely desirable.

Another category of am words are indeclinable forms related to nominal forms (I'm not sure of the correct grammatical way to think of these -- is it accusative singular intepreted as adverb?). Probably it is also acceptable to drop the final 'm' in these also, for the purpose of normalization.

So the main category to exclude from an final am->a rule is the category of verbs.

The main purpose of this issue to to develop an enhancement to the normalization algorithm that will drop the final m on words ending in am, but only where it is desirable to do so.

In the current hwnorm1c.txt normalization summary, there are 11259 cases where the normalized form ends in am.

As a supplement to thinking about this enhancement, here is a list of those 11259 cases.

gasyoun commented 6 years ago

supplement to thinking

Supplementary thinking :+1:

funderburkjim commented 6 years ago

whitelisting gam, etc. is not the whole answer

The problem is prefixed verbs. We would need to whitelist all the prefixed verbs as well, e.g. vinam.

But what about avanam ? There is a verb avanam = ava+nam; and also there is an adjective or noun avana related somehow to root av (present participle?).

And maybe there is even a derivation of avanam as a-vanam (non-forest?).

gasyoun commented 6 years ago

The problem is prefixed verbs. We would need to whitelist all the prefixed verbs as well

Here are all prefixed verbs from MW ending on -am.

acchāgam
atikram
atigam
atinam
atiprayam
atyatikram
atyākram
atyādham
atyutkram
adhikram
adhigam
adhinam
adhiyam
adhivikram
adhyākram
adhyāgam
anukam
anukram
anugam
anudham
anunam
anunikram
anuniśam
anuparāgam
anuparikram
anuprayam
anuyam
anuram
anuvikram
anuśam
anusaṃkram
anūtkram
antargam
antaryam
anvapakram
anvavakram
anvākram
anvāgam
anvācam
apakram
apagam
apadham
apanam
apigam
apidham
abhikam
abhikram
abhikṣam
abhigam
abhidham
abhinam
abhinikram
abhinirgam
abhiniśam
abhiniṣkram
abhiprakram
abhipraṇam
abhiram
abhivam
abhiśam
abhisaṃgam
abhisaṃdham
abhisaṃnam
abhisamāgam
abhisamāyam
abhyatikram
abhyapakram
abhyam
abhyavanam
abhyāgam
abhyāyam
abhyutkram
abhyupagam
am
araṃgam
araṃgam
avakram
avagam
avadham
avanam
avāgam
ākram
āgam
ācam
ātam
ādam
ānam
āprayam
āyam
āram
utkram
uttam
udāyam
udgam
uddam
udbhram
udyam
udram
udvam
unnam
upakram
upagam
upanam
upanigam
upaniśam
upaniṣkram
upaprayam
upabhram
upayam
uparam
upaśam
upaśram
upasaṃyam
upasaṃkram
upasaṃgam
upākram
upāgam
upāram
upāvanam
upāvaram
upotkram
upodyam
kam
kram
klam
kṣam
khaṇḍaśogam
gam
cam
cham
jam
jham
ḍam
tam
dam
dṛkpathamgam
dram
dvidhāgam
dham
nam
nikam
nikram
nigam
nitam
niyam
niram
nirākram
nirāyam
nirgam
nirṇam
nirdham
nirvam
nirvikram
niśam
niṣkram
nyāgam
parākram
parāgam
parāvam
parikram
pariklam
parigam
pariṇam
paritam
paribhram
pariyam
pariram
pariśram
paryāgam
prakram
pragam
praṇam
pratam
pratikram
pratigam
pratinam
pratinyāgam
pratiprayam
pratiyam
pratiram
prativiram
pratiśam
pratisaṃkram
pratyavagam
pratyākram
pratyāgam
pratyudgam
pratyudyam
pratyupakram
pratyupagam
prabhram
prayam
praram
praviśam
praśam
prodyam
pronnam
bhram
yam
ram
lam
vam
vikram
viklam
vigam
vidham
vinam
viniyam
vinirgam
vinirvam
viniśam
viniṣkram
viparikram
vipariṇam
vipragam
vibhram
viyam
viram
viśram
vyatikram
vyatigam
vyapakram
vyapagam
vyavagam
vyānam
vyāyam
vyutkram
vyuparam
vyupaśam
vyupāram
śam
śram
saṃyam
saṃram
saṃśam
saṃkram
saṃkṣam
saṃgam
saṃtam
saṃdham
saṃnam
saṃnigam
saṃniyam
saṃnirgam
saṃniśam
sam
stam
samatikram
samadhigam
samanukram
samanugam
samanuniśam
samabhikram
samabhigam
samabhyatikram
samabhyāgam
samabhyudgam
samam
samavagam
samākram
samāgam
samācam
samāyam
samutkram
samudāgam
samudgam
samudyam
samunnam
samupakram
samupagam
samupaśam
samupāgam
samparikram
samprakram
sampraṇam
samprayam
sambhram
stam
syam
funderburkjim commented 6 years ago

This is also a useful list. Will think how to use.

gasyoun commented 6 years ago

This is also a useful list.

Easy to make one. After your markup.

funderburkjim commented 6 years ago

Alternate solution # 1: Change the meta-line k1 field in AP for words like 'vanam'

ap.txt has been converted to meta-line format. This permits us to consider another solution to the 'vanam' problem.

Namely, we can change the k1 field of the meta-line from 'vanam' to 'vana':

old
<L>28407<pc>1386-1<k1>vanam<k2>vanam
proposed new:
<L>28407<pc>1386-1<k1>vana<k2>vanam

If this change were made, then it would have several consequences:

At the moment, I can't think of any downside to making this change in ap.txt.
This changes the meta data to facilitate access , but does not change the digitization of the text of ap.

We could use various techniques to discover other words in AP like 'vanam' and change their 'k1' similarly.

gasyoun commented 6 years ago

At the moment, I can't think of any downside to making this change in ap.txt.

Agree. @SergeA ?

funderburkjim commented 6 years ago

Alternate solution #2: add an 'alternate' headword

We have thus far used the the 'alternate headword' technique to allow searching for alternate spellings given by the author. For example, in ap90, we have agatIka as an alternate headword for agatika:

agatīka [p= 0009-b] : (agatīka is an alternate of agatika.) a. 1 Helpless, without any resort or resource; bālamena- magatimādāya Dk. 9; daṃḍastvagatikā ga- tiḥ Y. 1. 346 the last resource or shift; agatīkā gatirhyeṣā pāpā rājopase- vinām . Mb. [L=185.01]

image

In our list of extra headwords for ap90, this appears with a 'type' of alt:

<L>185.01<pc>0009-b<k1>agatIka<k2>agatIka<type>alt<LP>185<k1P>agatika<ln1>2702<ln2>2709

make another type called norm.

We could adapt this technique to allow searching for alternate spellings given by us. For instance, we could add vana as an alternate headword of type norm .

Current entry of aphw.txt for vanam:
<L>28407<pc>1386-1<k1>vanam<k2>vanam<ln1>263501<ln2>263704
Proposed new entry of aphw_extra.txt for vana:
<L>28407.1<pc>1386-1<k1>vana<k2>vana<type>norm<LP>28407<k1P>vanam<ln1>263501<ln2>263704

Differences from solution 1

solution 2: The entry would be available under TWO spellings: vanam and vana. The displays would identify the 'vana' spelling as an alternate spelling, with a phrase such as (vana is a normalized spelling of vanam) [contrast to the agatIka comment above] There would be TWO records in ap.xml. solution 1: The entry would be available under ONE spelling: vana

gasyoun commented 6 years ago

We could adapt this technique to allow searching for alternate spellings given by us. For instance, we could add vana as an alternate headword of type norm .

Indeed, why not.

There would be TWO records in ap.xml.

Not much fun for calculations, but...

SergeA commented 6 years ago

entry would be available under TWO spellings: vanam and vana

Sounds good. As the original AP90 spelling is vanam, it would be preferable to keep it searchable.

funderburkjim commented 6 years ago

Not much fun for calculations, but...

and

As the original AP90 spelling is vanam, it would be preferable to keep it searchable.

We have two opinions here. Maybe each of you could expand your thinking. We are moving towards thinking about how to deal with accessing multiple dictionaries in a useful way, and we need to develop some principles to guide our efforts in these uncharted waters. Trying to formulate principles underlying these two views on how to handle vanam/vana might be helpful.

drdhaval2785 commented 3 years ago

I vote for solution 2. We need to have the ability to search for both vana and vanam, even if it comes at a cost of inflation of xml or sqlite file. It would not matter much, because xml files are anyhow generated programmatically and not manually maintained.