vcp-skd1 comparison, part 2

funderburkjim commented 4 years ago

This continues the root-matching exercise discussed in #9.

The programs and reports are in the vcp_skd1 directory. To see html files as html in your browser, you will need to download the raw files and open the downloaded files in your browser.

vcp_ecs.txt the current equivalence classes of root entries for VCP ; vcp_ecs_deva.txt Devanagari version
skd_ecs.txt the current equivalence classes of root entries for SKD ; skd_ecs_deva.txt Devanagari version
vcp_skd_ec_map.txt current mapping between vcp equivalence classes and skd equivalence classes ; vcp_skd_ec_map_deva.txt Devanagari version
vcp_skd_ec_verb2.html current mapping between vcp equivalence classes and skd equivalence classes with extract from entries ; vcp_skd_ec_verb2.html Devanagari version

funderburkjim commented 4 years ago

Why equivalence classes?

In the previous vcp-skd work, discusses in #9, we ran into a limitation. The example (ref) VCP उज्झ 8722 = SKD उद्झ 4713 SKD उज्झ 4431 could not be handled properly. Here SKD has two different headword spellings that should both correpsond to the single VCP headword spelling. This left SKD उद्झ as unmatched.

The equivalence class notion aims to handle this and other similar, but not yet identified, cases; and thereby paint a truer picture of the correspondence between verbs in VCP and verbs in SKD.

basic idea

The notion of 'equivalence classes' is a useful concept in many areas of mathematics. For example, integers can be constructed as equivalence classes of pairs of natural numbers (see).

In our situation, we start with the set of entries from the Cologne digitization of a particular dictionary. Each entry has a specific Cologne id, which distinguishes that entry from all other entries in the particular dictionary. And each entry also has a particular headword spelling.

Next, in our study of verbs, we use some method to determine which of the entries of the dictionary corresponds to a verb. This leads to a list of verb entries.

Example 1: skd verb entries.
Example 2: vcp verb entries.

entry equivalence by headword

We have silently assumed that two verb entries from a particular dictionary are equivalent if they have the same headword spelling. Obviously the author of a particular dictionary had some reason for providing, for some headwords, more than one entry. But these reasons are not systematic and are difficult to infer, even when the two entries are of the same grammatical type (e.g. when both entries are verbs). Thus we have made the reasonable assumption that all (verb) entries with the same headword should be considered equivalent.

In applying this to VCP, we get the equivalence classes of vcp_ecs.txt (and vcp_ecs_deva.txt).

Consider the vcp equivalence classes (vcp_ecs.txt). An entry is a pair (headword,cologne id). For headword 'aMSa' there is only one verb entry with this spelling, with cologne id=4: aMSa,4. But note 'aca' appears as the headword for 3 entries: aca,604;aca,605;aca,606. By searching for semicolon character, we see that 592 of the VCP equivalence classes have multiple entries.

funderburkjim commented 4 years ago

equivalence classes with different headwords

We are now in a position to modify the equivalence classes for a given dictionary. We do this by having a particular file. In the case of SKD, this file is skd_ecs_manual.txt. Currently this file has just 2 entries:

udJa,4713;ujJa,4431
staBa,40439;stamBa,40448

From skd_verb_filter.txt, we know that there is just one entry with headword 'udJa' (search term k1=udJa,) and just one entry with headword 'ujJa'. Now, @Shalu411 determined that we should think of these two entries in SKD as the same. Thus, we merge the entries for udJa and ujJa to get a new equivalence class udJa,4713;ujJa,4431.

Similarly, I determined that we should think of the entries for 'staBa' and 'stamBa' as equivalent. Hence the new equivalence class for SKD of staBa,40439;stamBa,40448.

There is also a file for VCP: vcp_ecs_manual.txt. Currently this file is empty, which implies that currently we consider all the distinct verb spellings in VCP to be non-equivalent.

As @Shalu411 continues studying the non-matching VCP and SKD verbs, I anticipate that we will add several more equivalences for SKD and perhaps a few for VCP.

funderburkjim commented 4 years ago

Matching VCP and SKD equivalences manually

The main focus of this research is to match VCP and SKD verbs. By considering equivalence classes, this research task resolves to matching VCP equivalence classes to SKD equivalence classes.

In my file-naming conventions, I use the term 'map' instead of 'match'. This is because I think of matching two sets as constructing a functional map between the two sets.

The mapping between vcp equivalence classes and skd equivalence classes is presented in two report forms. Each form shows all the classes from both dictionaries. The classes are ordered alphabetically.

short form of matching report

This report form is vcp_skd_ec_map.txt (or vcp_skd_ec_map_deva.txt).

Each line of this report shows a vcp equivalence class and an skd equivalence class, and asserts that these two classes match. For example: vcp=ujJa,8722 skd=udJa,4713;ujJa,4431 (*).

When there is no match for a vcp equivalence class, the skd equivalence class shows as '?'. For example: vcp=uras,10045 skd=? There are 148 of these.

When there is no match for an skd equivalence class, the vcp equivalence class shows as '?'. For example: vcp=? skd=atwaNa,706. There are 69 of these.

One further annotation is (*). This means that, in the matching, there is some difference in the headword spelling between VCP and SKD. For example, vcp=amBa,3980 skd=aBa,1638 (*) There are 72 of these annotations currently. Note that the 'ujJa' example above also shows this (*) annotation; this is because the skd class also has a headword spelling udJa differing from the vcp headword spelling (ujJa).

funderburkjim commented 4 years ago

long form of matching report.

This is report vcp_skd_ec_verb2.html and the Devanagari version vcp_skd_ec_verb2_deva.html This report is in the form of an html document, so you'll need to download the raw form and then open the download html file in the browser.

The long form report contains all the information of the short report.

additional information of long report

The long report has some detail from the underlying dictionary entries.

Note the ... etc. etc. etc. .... This means that there is more information in the dictionary for this entry.

funderburkjim commented 4 years ago

Mapping principle

There are currently two methods of matching -- a 'manual' method and a 'general' method.

The 'manual' method uses a file of headword spelling correspondences: vcp_skd_map.txt.

For example garba:garbba means that the equivalence class in vcp containing headword spelled 'garba' is asserted to match the equivalence class in skd containing headword spelled 'garbba'.

These correspondences were developed mostly by me; I think @Shalu411 found some of them.

The 'general' method uses the rule: Given an equivalence class ec1 for vcp and an equivalence class ec2 for skd, if there is a headword spelling 'X' in both ec1 and ec2, then ec1 is aserted to match ec2. This rule is harder to verbalize than it is to use. Example: vcp=aBra,3722 skd=aBra,1803 these two equivalence classes are the same because the verb speling 'aBra' is common to both.

funderburkjim commented 4 years ago

Next steps

I think the next step is to continue the comparison of non-matches, using the two 'ec' mapping reports. This will likely turn up more examples like 'ujJa'. For example, I think 'drA'/'drE' is such an example.

There are also some other kinds of cases which probably should match. It might be that the anusvara in skd is a digitization error.

vcp=hiqa,48054 skd=?
vcp=? skd=hiqaM,41771

@Shalu411 : the ball is now in your court! Hope I've given you enough material to proceed.

Shalu411 commented 4 years ago

Hariom Jim It's 29 days that ball has been in my court. I had shifted my house and many things were there to attend.. So I couldn't attend to this dhatu issue. Hope these are the files that I should be working upon now on-

This is report vcp_skd_ec_verb2.html and the Devanagari version vcp_skd_ec_verb2_deva.html This report is in the form of an html document, so you'll need to download the raw form and then open the download html file in the browser.

gasyoun commented 4 years ago

Hope these are the files that I should be working upon now on-

So do I, our Bangalore Sanskrit scholar.

funderburkjim commented 4 years ago

@Shalu411 The main report is referred to as the 'long form of matching report', mentioned in the comment above.

The report file is named 'vcp_skd_ec_verb2_deva.html'. To get this file:

First download the file by:
- right click and 'save link as' this link: vcp_skd_ec_verb2_deva.html
Then open the downloaded file with your browser.

Your main task is to find how to resolve the '=?' cases of this report. For example, with 'vcp=अंह, skd=?'

Is there some verb entry in skd that we should match to the verb अंह of vcp ? If so, how is that skd verb spelled?
Or is अंह a verb unique to vcp ?

Shalu411 commented 3 years ago

Namaste Have started again. I heard somewhere that it does not matter how many times you fall down, ultimately what matters is-whether you could manage to get up or not! In the similar lines, whether I restart or not matters. So here is the first interesting case- Issue-1 - vcp=अट्ट, skd=अट्ट skd-vcp-1-aTTa

There are two dhatus in vcp and only one in skd in this list. But there are actually two in skd too. But it got merged as a part of another headword. (Showed in png file) skd-1-aTTa :)

Shalu411 commented 3 years ago

Issue 2- vcp=अन्चु, skd=?

vcp k1=अन्चु, L=2185 = skd k1=अन्च, L=1215 skd-vcp-2-aYcu

Compare- VCP अन्चु¦ गतौ अचिवत् ८२ पृ० दृश्यम् VS SKD अन्च¦ उ पूजने । गमने । म्लिष्टोक्तौ । This seems only to be style of presentation- See the ु in VCP and उ in SKD. skd-vcp-2-aYcu2

Shalu411 commented 3 years ago

Off-line issue of Typos- k1=अन्च, L=2184 अन्च¦ व्य क्तौ चु० The two letters is actually one word- व्यक्तौ . It is a typo- to have space in between. What to do with these cases as of now? I see this in the html doc. list vcp_skd_ec_verb2_deva.html . Some more are there

Shalu411 commented 3 years ago

Namaste The comparison is on- First set of VCP with SKD is going on. The picks so far- VCP अन्दोल 2361 = SKD आन्दोल 3499 VCP अन्ध 2362 = SKD अन्धं 1295 VCP अट्ट, L=792 = SKD (merged with previous)अट्ट क तौच्छ्ये । अनादरे । VCP अन्चु 2185 = SKD अन्च 1215 VCP इङ् 8100 = SKD इ 4017 VCP उर्द्द 10084 = SKD उर्द 5161 VCP उलड 10105 = SKD ओलड 5661 VCP ऋन्फ 10512 = SKD ऋम्फ 5409 VCP ऋश 10520 = SKD ऋश 5410 VCP कद्ड 11690 = SKD कद्ड 6169 VCP कन 11700 = SKD कन 6176

Others are not found. I am not noting them separately when match is not found. --Hariom

gasyoun commented 3 years ago

noting them separately when match is not found.

Right, that's the way to do.

funderburkjim commented 3 years ago

@Shalu411

Re अटाट्या : Do you suggest that SKD has a print error at अट्ट क तौच्छ्ये -- the error being that this should start a new headword?

Also, how to translate क तौच्छ्ये ?

funderburkjim commented 3 years ago

VCP अन्ध 2362 = SKD अन्धं 1295

Disagree. SKD अन्धं 1295 is a nominal while VCP अन्ध 2362 is a verb.

VCP अन्ध 2363 and 2388 are nominals.

Comparing texts, I think VCP अन्ध 2363 corresponds to SKD अन्धं 1295

funderburkjim commented 3 years ago

How is the list above derived ?

What is your method?

what things are you looking at (what files and or displays)?

What are the criteria for putting something in the list?

Are all the items supposed to be verbs?

Why is VCP ऋश 10520 = SKD ऋश 5410 in the list (even though the spelling is same in VCP and SKD)?

Shalu411 commented 3 years ago

Namaste Disagree. SKD अन्धं 1295 is a nominal while VCP अन्ध 2362 is a verb. Am extremely sorry ! It was wrongly put there . It is actually meant to express the third case here. But it's wrong to put so! Do you suggest that SKD has a print error at अट्ट क तौच्छ्ये -- the error being that this should start a new headword? Yes. It is a new headword.. But so is it given in the book printed also. (Image attached) aTTa-SKD

Also, how to translate क तौच्छ्ये ? It is same as the other dhatu entries. We need not translate it- It is the internal dhatu detail given by the author. We don't take it in headword-dhatu

How is the list above derived ? From cross-checking each entry back in the dictionary.. both digital form in website and the printed book scan. What is your method? I take case by case- one at a time. Type the word in the search box in advanced mode--> Check around the words up and down--> open the printed book page through adjacent words and double check if the dhatu is around anywhere --> check for the dhatu meaning word or first-form (the ति form) if not found then confirm it as a "no-match".

what things are you looking at (what files and or displays)? Now I am checking SKD against VCP- So I use SKD

Mostly advanced display with Devanagari or SLP1
Then I also refer to the print-scan by the link provided in the adjacent words (because many words contain same printed page link)

What are the criteria for putting something in the list? After the above steps- Sometimes rarely we find a match in the surprising circumstances (as explained in the five criteria. Then after double confirming by comparison with the other details as the dhatu-meaning-word and the first-form- at last declare it as a match. Are all the items supposed to be verbs? Where? in my list? Or the list which is provided to me? Both- Yes. 101%. Except the silly अन्धं Why is VCP ऋश 10520 = SKD ऋश 5410 in the list (even though the spelling is same in VCP and SKD)? I am not sure why it got missed and is given as a no-match even when there is it in the digitized version and the printed book- both. May be it flew away unnoticed? -शुभमस्तु Keep smiling. :)

gasyoun commented 3 years ago

I am not sure why it got missed and is given as a no-match even when there is it in the digitized version and the printed book- both.

Hope @funderburkjim is happy with the answers.

sanskrit-lexicon / SKD