sanskrit-lexicon / alternateheadwords

Prepare list of alternate headwords for all Cologne dictionaries
1 stars 0 forks source link

PWGpreverb #12

Open funderburkjim opened 7 years ago

funderburkjim commented 7 years ago

This is analysis of the the embedded prefixed verbs in PWG. This was done independently of the data/PWG study. The results are in the folder PWGpreverb.

funderburkjim commented 7 years ago

steps in the analysis.

funderburkjim commented 7 years ago

A good next step would be for @drdhaval2785 and me to examine the compare.txt for the #NEQ and #NA cases. A first glance leads me to think that there are some errors in sandhi in both systems . e.g.:

Also,

When some such revisions are made, the comparison and other statistics can be revised.

drdhaval2785 commented 7 years ago

@funderburkjim There was a small issue in the code which ignored verbs starting from capitals. Now it is corrected. Now both have 8644 entries. Please git pull before you do any further changes in the code.

gasyoun commented 7 years ago

There was a small issue in the code which ignored verbs starting from capitals.

Nice catch.

drdhaval2785 commented 7 years ago

Major difference which need to be corrected in preverb1a is regarding the fifth letter / anusvAra convention.

To explicitly correspond to PWG convention, PWG keeps 'saM' and not 'saN', 'saY' etc in preverbs. Preverb1a gives 'saNkaTay' whereas pwgehw3.txt gives 'saMkaTay'. @funderburkjim compare.py needs to be modified to keep it 'saM' only. This will show greater correspondence. See capture

drdhaval2785 commented 7 years ago

Next major change is 'n' -> 'R' change This is a bit weird. Sometimes it changes to 'n' and sometimes it doesn't. See

833: 17753:kunT:prani:pranikunT:37383 ##NEQ kunT@prani@praRikunT@37383@9 89199:vah:praRi:praRivah:192891 ##EQ vah@praRi@praRivah@192891@9

in PWGehw3.txt, it is converted by a rule to praRi. I guess this needs to be kept as it is in the original entry in PWG. PWG seems to be a bit choosy picky in this.

drdhaval2785 commented 7 years ago

Line 1064: 20315:kzip:nis:nizkzip:43003 ##NEQ kzip@nis@niHkzip@43003@1

nizkzip is wrong. niHkzip is correct. capture

gasyoun commented 7 years ago

This will show greater correspondence.

Yes, this convention is met not only in PWG, but in KCH as well.

drdhaval2785 commented 7 years ago

Line 1064: 20315:kzip:nis:nizkzip:43003 ##NEQ kzip@nis@niHkzip@43003@1

nizkzip is wrong. niHkzip is correct. ![Uploading capture.png…]()

drdhaval2785 commented 7 years ago

@funderburkjim Let me document entries which are non-'sam' entries.

    Line 53: 2199:an:pra:prAn:4564 ##NEQ an@pra@prAR@4564@1
    Line 54: 2199:an:anupra:anuprAn:4566 ##NEQ an@anupra@anuprAR@4566@1
    Line 55: 2199:an:aBipra:aBiprAn:4568 ##NEQ an@aBipra@aBiprAR@4568@1
    Line 387: 10096:in:pra:pren:21106 ##NEQ in@pra@preR@21106@1
    Line 586: 13470:ej:pra:prEj:28286 ##NEQ ej@pra@prej@28286@9
    Line 639: 15071:kar:is:iskar:31616 ##NEQ kar@is@izkar@31616@1
    Line 832: 17753:kunT:kunT:kuntkunT:37381 ##NEQ kunT@kunT@kunT@37381@1
    Line 1064: 20315:kzip:nis:nizkzip:43003 ##NEQ kzip@nis@niHkzip@43003@1
    Line 1065: 20315:kzip:vinis:vinizkzip:43005 ##NEQ kzip@vinis@viniHkzip@43005@9
    Line 1183: 21814:gam:aram:araNgam:46264 ##NEQ gam@aram@aramgam@46264@9
    Line 1187: 21814:gam:astam:astaNgam:46272 ##NEQ gam@astam@astamgam@46272@9
    Line 1347: 22587:guRay:anuguRita:anuguRitaguRay:48143 ##NEQ guRay@anuguRita@anuguRita@48143@1
    Line 1349: 22587:guRay:praguRita:praguRitaguRay:48147 ##NEQ guRay@praguRita@praguRita@48147@1
    Line 1529: 24519:cat:nis:niScat:52401 ##NEQ cat@nis@nizcat@52401@9
    Line 1541: 24902:car:antar:antaScar:53202 ##NEQ car@antar@antaHcar@53202@9
    Line 1570: 24902:car:dus:duScar:53262 ##NEQ car@dus@duzcar@53262@9
    Line 1571: 24902:car:nis:niScar:53264 ##NEQ car@nis@nizcar@53264@9
    Line 1572: 24902:car:vinis:viniScar:53266 ##NEQ car@vinis@vinizcar@53266@9
    Line 1597: 24974:cart:nis:niScart:53462 ##NEQ cart@nis@nizcart@53462@9
    Line 1644: 25298:ci:nis:niSci:54209 ##NEQ ci@nis@nizci@54209@9
    Line 1645: 25298:ci:aBinis:aBiniSci:54211 ##NEQ ci@aBinis@aBinizci@54211@9
    Line 1646: 25298:ci:avanis:avaniSci:54213 ##NEQ ci@avanis@avanizci@54213@9
    Line 1647: 25298:ci:vinis:viniSci:54215 ##NEQ ci@vinis@vinizci@54215@9
    Line 1665: 25554:cint:nis:niScint:54768 ##NEQ cint@nis@nizcint@54768@9
    Line 1706: 26032:cyu:nis:niScyu:55808 ##NEQ cyu@nis@nizcyu@55808@9
    Line 1715: 26080:Cad:anu:anucCad:55924 ##NEQ Cad@anu@anuCad@55924@9
    Line 1716: 26080:Cad:aBi:aBicCad:55926 ##NEQ Cad@aBi@aBiCad@55926@9
    Line 1718: 26080:Cad:ava:avacCad:55930 ##NEQ Cad@ava@avaCad@55930@9
    Line 1720: 26080:Cad:A:AcCad:55934 ##NEQ Cad@A@ACad@55934@1
    Line 1725: 26080:Cad:upa:upacCad:55944 ##NEQ Cad@upa@upaCad@55944@9
funderburkjim commented 7 years ago

After the extra prefixes incorporated into PWGehw3 (and before any other adjustments), the baseline stats of compare.txt are:

7633 prefixed headwords in both, spellings the same
1011 prefixed headwords in both, spellings different
0 prefixed headwords only in preverb1a.txt
0 prefixed headwords only in ../PWG/PWGehw3.txt

So, the two lists of prefixes is identical.

Will next tackle differences in nasals in the comparison.

gasyoun commented 7 years ago

Will next tackle differences in nasals in the comparison.

Yeah, that's a real battleground and there seems to be no end to it.

funderburkjim commented 7 years ago

First skirmish: apply hwnorm1c headword normalization logic to those cases not matched. This cuts the problem list almost in half. The ones that match with only this normalization are marked ##EQNORM in compare.txt. Only compare.py was altered.

8644 records written to compare.txt
8114 prefixed headwords in both, spellings the same
 Of these, 481 have same spellings AFTER HWNORM1C normalization
530 prefixed headwords in both, spellings different
funderburkjim commented 7 years ago

One way to analyze the remaining 530 cases where the spelling differs for the implied prefixed verb (between PWGpreverb and PWGehw3) is to use the preverb1b_mw.txt file of spellings. Recall these are cases where a match was found between the PWGpreverb spelling and an MW root spelling.

This file is the result of the comparison: temp_compare_mw.txt

When this is done, the list of 530 is separated into about two parts of almost the same size:

I'll focus on the NOTMW cases, and @drdhaval2785 might focus on the PVMW cases.

gasyoun commented 7 years ago

First skirmish: apply hwnorm1c headword normalization logic to those cases not matched

Well done, was experimenting on that logic locally.

I'll focus on the NOTMW cases, and @drdhaval2785 might focus on the PVMW cases.

@drdhaval2785 please, please, please :pray:

drdhaval2785 commented 7 years ago

Bite size has become chewable now. Will try to complete it today.

drdhaval2785 commented 7 years ago

@funderburkjim Sandhi rules in ehw3.txt was a bit primitive, as I didn't have any ready module for sandhi. So they were a bag of regexes as they came to my mid. Scharfsandhi seems to be a bit advanced. So ignore my output. Both methods have been clashed and yours is better. So let us dump ehw3.txt and go ahead with preverb1a.

gasyoun commented 7 years ago

Scharfsandhi seems to be a bit advanced.

Good to know that we have it.

funderburkjim commented 7 years ago

dump ehw3.txt and go ahead with preverb1a.

OK. I think it was helpful to have the independent approaches initially.

I noticed that ehw3 is different from yesterday and regenerated the comparison, Now there are are

I have yet to look at these; my next task.