Removed another headword normalization from skd

funderburkjim commented 9 years ago

The headwords for nouns in SKD appear in the text in their nominative singular forms. So, rAmaH in SKD corresponds to 'rAma' in MW. Similarly also for neuter nouns (vanam in SKD = vana in MW). In an attempt to make word lookup in SKD comparable to MW, I had 'normalized' the SKD headwords by removing final visarga and 'm' on SKD headwords. This 'normalization' is now rescinded.

drdhaval2785 commented 9 years ago

If you remove the final visarga - how will you find a word like 'vapuH' which doesn't owe its visarga to the nominative singular case ending? There are two cases - 1. Visarga belongs to the case ending.

Visarga is derived from consonants like 's','r' at the end of word.

Therefore mass deletion of visargas may prove counterproductive.

gasyoun commented 9 years ago

"counterproductive" is too harsh. Case 1 is 99%, I guess. But to spoil the 1% hurts for sure. Jim, could you please make a list of the words changed or to be changed? What we could really ask is to make the visarga stop list to our friends from India, so we do not shoot our-self in the foot.

Shalu411 commented 9 years ago

Most Sanskrit people are used to Apte and MW. One simple solution is to can take any one standard and apply. I use Apte, and he seems logical and uniform throughout. Don't know anything about MW. So can't say. For the time being, visarga issue could be "solved" this way.. before we can find lexicography experts (in Sanskrit) from India. Making a list is always a good idea. And let's mention it in preface. Will this do?

drdhaval2785 commented 9 years ago

Same exercise here as was explained in issue 1

Prepare a list of words ending in 'M','H' in MW.
Prepare a list of words ending in 'r','s' in MW.
Convert the list 2 for the following two items -> Change last 'r' to 'H'. Change last 's' to 'H'. (This is because most of words ending in 'r' or 's' can be converted to 'H' by grammatical rules of Sanskrit. There is inconsistency in most of the dictionaries whether they use 's'/'r' or 'H'. Example पुनर्‌ - पुनः । वपुस्‌ - वपुः).
Merge list 1 and 3.
Prepare a list of words ending in 'M','H' in SKD.

Then A. Keep the words in list 5 which are also seen in list 4 intact. B. For the rest of the words - remove 'M' / 'H' at end.

This way we will be able to achieve 99.95% success in what we want to achieve

funderburkjim commented 9 years ago

To clarify the current status of visarga, et. al. in headwords for SKD in the Cologne digitizations, it seems necessary to explain some background context. Some details of the comments are particular to SKD, but the general idea is relevant to all the digitizations at Cologne.

The 'digitization' is represented as a text file, skd.txt. The displays are based on skd.xml. You can download both of these from the downloads for skd.

The lines of skd.txt correspond to the lines of the underlying text. Here are the lines of skd.txt for two headwords; for later reference, I mention that the first 6 lines are lines 137266-137271 of skd.txt; and the last 4 lines are lines 137272-137275.

<HI>tanuH, tri, (tana + “bhRmRzIti |” uNAM 1 | 7 |
<>ityuH |) alpaH | viralaH | (yathA, manuH | 3 | 10 |
<>“tanulomakezadazanAM mRdbaGgImudbahet striyam ||”)
<>kRzaH | iti medinI | ne, 9 || (yathA, AryyA-
<>zaptazatyAm | 525 |
<>“vitarantI rasamantarmamArdrabhAvaM tanoSi tanugAtri ! ||”)
<HI>tanuH, [s] klI, (tanoti tanyate vA | tana +
<>“arttipRRvapiyajitanidhanitapibhyo nit |” uNAM
<>2 | 118 | iti usiH sa ca nit |) zarI-
<>ram | ityuNAdikoSaH ||

Here is a snippet of the corresponding scanned image:

You see that the digitization adds the markup at the two lines that are 'outdented' in the scan; and the markup <> for the 'indented' lines. For skd, that's essentially all the markup that is added. The rest of the digitization consists of a representation of the Devanagari text in Harvard-Kyoto transliteration.

Now, we know skd is a dictionary, and a dictionary is organized by headwords. So, how can a computer program group the lines of skd.txt into groups corresponding to headwords? Well, lines of skd starting with identify the headword, and (at least for these two examples), the headword consists of letters up to the comma following the . So, 'tanuH' for both these examples. So, our little example comprises two headwords. A program constructs a list of these headwords and their corresponding lines, and puts this list into a file, skdhw0.txt. In our case, the two lines are

2-583:tanuH:137266,137271
2-583:tanuH:137272,137275

(the 2-583 is the page number of the scans.)

Bear with me, I know this is tedious. The file is named 'skdhw0' since there are going to be two additional variations of this basic headword list. There is no 'normalization' of any kind in skdhw0. Also, this file (and the others mentioned below) are included in the skdxml download.

The next version of this headword list, skdhw1.txt, is where the normalizations occur. All the normalizations are detailed in an ancillary file skdhw1_note.txt. In terms of Python code, here are the normalization rules currently in use (the bottom three are NOT currently used)

NOTE: 'hw' is the headword from skdhw0.  Its the part, X, in <HI>X,  (between <HI> and first comma)
 hw = re.sub(r'D2','D',hw)
 hw = re.sub(r'MM','M',hw)
 hw = re.sub(r'M$','m',hw)
 hw = re.sub(r' .*$','',hw)
 hw = re.sub(r'\(.*?\)','',hw)
 hw = re.sub(r"'",'',hw)
 hw = re.sub(r';','',hw)
 hw = re.sub(r'\(','',hw)
 #hw = re.sub(r'[mHM]$','',hw)  # July 19 - removed this and next two normalizations
 #hw = re.sub(r'r(.)\1',r'r\1',hw)
 #hw = re.sub(r'R(.)\1',r'R\1',hw)

There are about 8000 normalizations, in a total of about 40,000 identified headwords.

The most prevalent normalization seems to be due to the rule that changes a final anusvara (M) to m. Another one deals with cases like

1-001:  'aMzumatI strI' => 'aMzumatI'  :115,118

Here, there is no 'comma' following aMzumatI, but the 'strI' is not part of the headword.

So, the details of construction of skdhw1 needs to be examined. Ideally, a Sanskrit expert would look at all the cases, and see if any were mistiaken.

Also, there are cases like

2-144:  'kube(ve)raH' => 'kuberaH'  :71062,71072

where the author no doubt meant that there were two acceptable spellings of a given word. Currently, no use is made of the 'kuveraH' variant.

Now, the next form of the headwords is skdhw2. It is constructed from skdhw1 by converting the Harvard-Kyoto spellings to slp1 spellings. There are various reasons for doing this, which I can discuss separately if anyone is interested.

skdhw2.txt is considered the final form of the headwords. The xml form of the digitization, skd.xml, is made from skdhw2.txt and skd..txt. Here is what our two examples look like in skd.xml:

<H1><h><key1>tanuH</key1><key2>tanuH</key2></h><body><HI/><s>tanuH, tri, (tana + “BfmfSIti .” uRAM 1 . 7 .</s><lb/><s>ityuH .) alpaH . viralaH . (yaTA, manuH . 3 . 10 .</s><lb/><s>“tanulomakeSadaSanAM mfdbaNgImudbahet striyam ..”)</s><lb/><s>kfSaH . iti medinI . ne, 9 .. (yaTA, AryyA-</s><lb/><s>SaptaSatyAm . 525 .</s><lb/><s>“vitarantI rasamantarmamArdraBAvaM tanozi tanugAtri ! ..”)</s></body><tail><L>14326</L><pc>2-583</pc></tail></H1>
<H1><h><key1>tanuH</key1><key2>tanuH</key2></h><body><HI/><s>tanuH, [s] klI, (tanoti tanyate vA . tana +</s><lb/><s>“arttipFvapiyajitaniDanitapiByo nit .” uRAM</s><lb/><s>2 . 118 . iti usiH sa ca nit .) SarI-</s><lb/><s>ram . ityuRAdikozaH ..</s></body><tail><L>14327</L><pc>2-583</pc></tail></H1>

In particular, the contents of the key1 element is 'the' headword used in the displays. All the Sanskrit is changed into slp1 transliteration. Note that all the original text of the digitization is also present in the xml records, within the <body> element

Note in particular that 'tanuH' is the "correct" headword spelling currently.

The above discussion does not aim to answer all the questions about how the displays SHOULD work; but rather to discuss in detail how the displays DO work currently. In particular, handling of the situation of differences in convention (Apte/SKD v. MW) in headword spelling may require another 'layer' in the software.

funderburkjim commented 9 years ago

Re: "how will you find a word like 'vapuH' "

That problem was one reason for excluding 'remove final H' from the normalization rules.

This case is interesting, when thinking about correspodences to MW, in that MW uses spelling 'vapus'. There are also the homorganic nasal differences.

What needs to be done first, I think, is for a Sanskrit expert to review 'skdhw1_note.txt', and find errors in the currently remaining normalizations.

gasyoun commented 9 years ago

I ask for a list of headwords with (), like kube(ve)raH. Because the task has already been solved once on Vacaspatyam manually by Shalu. That is the only way - if we do some "reconstruction" or "modernization", this should be definitely a part of it.

funderburkjim commented 9 years ago

Regarding "list of headwords with (), like kube(ve)raH":

The file skdhw1_note.txt of today, July 23, 2014, is now in this skd repository; the link is https://github.com/sanskrit-lexicon/SKD/blob/master/skdhw1_note.txt. It has all the 'headword normalizations' for skd.

I presume you can readily filter this list down to the 350 or so cases matching the criterion you mention, but let me know otherwise.

You read the lines as:

1-001:  'aMzakaM' => 'aMzakam'  :76,76

On page 001 of volume 1, there is a headword originally spelled 'amzakaM' whose spelling has been normalized to 'aMzakam'. The text for this head word comprises lines 76-76 of skd.txt (just 1 line in this example.)

I don't understand the reference "the task has already been solved once on Vacaspatyam manually by Shalu".

gasyoun commented 9 years ago

The list is easy enough to work with. Would love to know Dhaval's thoughts on it.

drdhaval2785 commented 9 years ago

I give this sanskrit maxim in response

"रलयोर्डलयोस्तद्वज्जययोर्वबयोरपि । शसयोर्मनयोश्चान्त्ये सविसर्गाविसर्गयोः ॥ सबिन्दुकाबिन्दुकयोः स्यादभेदेन कल्पनम्‌ ।" इति कुट्टनीमतस्य रसदीपिकाव्याख्या (१८तमः श्लोकः) ।" Which means that in sanskrit pronunciation system / for yamaka / shlesha alankAras, r=l, D=l, j=y, v=b, z=s; (in HK protocol) m=n when at end of a word; visarga = no visarga when at end of a word; sabinduka = abinduka.

So, it is not wrong to have 'b' in place of usual 'v'. It is only odd.

sanskrit-lexicon / SKD

Removed another headword normalization from skd #3