PWGpreverb - Githubissues

funderburkjim commented 7 years ago

This is analysis of the the embedded prefixed verbs in PWG. This was done independently of the data/PWG study. The results are in the folder PWGpreverb.

funderburkjim commented 7 years ago

steps in the analysis.

preverb1 Use pwghw2.txt and pwg.txt. Scan the pwg.txt digitization for lines beginning -<P>- {#...#} and write out the parent headword, the prefix, and the record number L of the headword in pwghw2.txt and the line number of pwg.txt containing the prefix.
- 8644 cases found under 1209 PWG headwords
preverb1a Construct the implied headword by joining (with sandhi) the prefix and the headword. Sandhi is compound sandhi (from ScharfSandhi code), with certain empirically invented revisions for Preverb sandhi.
preverb1b match the implied headwords of preverb1a to verb records from MW (verb_step0a.txt in MWvlex repository)
- several empirically derived rules were used to extend the matching
- records were written to two files, one for matches and one for non-matches
- 6379 records in preverb1b_mw.txt
- 5052 of these matches involve no spelling adjustments (marked as MWSAME)
- 1327 of these matches DO involve a spelling adjustment (marked as MWDIFF)
- 2265 records in preverb1b_notmw.txt
compare Once I realized that Dhaval had done something that appears to be a similar analysis, I wrote a comparison of the two.
- Comparison was done between preverb1a.txt and PWGehw3.txt
- PWGehw3 has 7090 records v. 8644 in preverb1a
- the two systems were 'merged', using the pwg.txt line-number as the point of merger.
- 6249 prefixed headwords in both, spellings the same (marked '##EQ')
- 841 prefixed headwords in both, spellings different (marked '##NEQ')
- 1554 prefixed headwords only in preverb1a.txt (marked '##NA')
- 0 prefixed headwords only in PWGehw3.txt (so, PWGehw3 cases are a subset of preverb1a cases_

funderburkjim commented 7 years ago

A good next step would be for @drdhaval2785 and me to examine the compare.txt for the #NEQ and #NA cases. A first glance leads me to think that there are some errors in sandhi in both systems . e.g.:

in preverb1a, dus+Vowel, e.g. prAdus + as is not handled properly
in PWGehw3, anvati + i is not handled properly, I think.

Also,

Why are there so many (1554) 'extra' cases in preverb1a ?

When some such revisions are made, the comparison and other statistics can be revised.

drdhaval2785 commented 7 years ago

@funderburkjim There was a small issue in the code which ignored verbs starting from capitals. Now it is corrected. Now both have 8644 entries. Please git pull before you do any further changes in the code.

gasyoun commented 7 years ago

There was a small issue in the code which ignored verbs starting from capitals.

Nice catch.

drdhaval2785 commented 7 years ago

Major difference which need to be corrected in preverb1a is regarding the fifth letter / anusvAra convention.

To explicitly correspond to PWG convention, PWG keeps 'saM' and not 'saN', 'saY' etc in preverbs. Preverb1a gives 'saNkaTay' whereas pwgehw3.txt gives 'saMkaTay'. @funderburkjim compare.py needs to be modified to keep it 'saM' only. This will show greater correspondence. See capture

drdhaval2785 commented 7 years ago

Next major change is 'n' -> 'R' change This is a bit weird. Sometimes it changes to 'n' and sometimes it doesn't. See

833: 17753:kunT:prani:pranikunT:37383 ##NEQ kunT@prani@praRikunT@37383@9 89199:vah:praRi:praRivah:192891 ##EQ vah@praRi@praRivah@192891@9

in PWGehw3.txt, it is converted by a rule to praRi. I guess this needs to be kept as it is in the original entry in PWG. PWG seems to be a bit choosy picky in this.

drdhaval2785 commented 7 years ago

Line 1064: 20315:kzip:nis:nizkzip:43003 ##NEQ kzip@nis@niHkzip@43003@1

nizkzip is wrong. niHkzip is correct. capture

gasyoun commented 7 years ago

This will show greater correspondence.

Yes, this convention is met not only in PWG, but in KCH as well.

drdhaval2785 commented 7 years ago

Line 1064: 20315:kzip:nis:nizkzip:43003 ##NEQ kzip@nis@niHkzip@43003@1

nizkzip is wrong. niHkzip is correct. ![Uploading capture.png…]()

drdhaval2785 commented 7 years ago

@funderburkjim Let me document entries which are non-'sam' entries.

    Line 53: 2199:an:pra:prAn:4564 ##NEQ an@pra@prAR@4564@1
    Line 54: 2199:an:anupra:anuprAn:4566 ##NEQ an@anupra@anuprAR@4566@1
    Line 55: 2199:an:aBipra:aBiprAn:4568 ##NEQ an@aBipra@aBiprAR@4568@1
    Line 387: 10096:in:pra:pren:21106 ##NEQ in@pra@preR@21106@1
    Line 586: 13470:ej:pra:prEj:28286 ##NEQ ej@pra@prej@28286@9
    Line 639: 15071:kar:is:iskar:31616 ##NEQ kar@is@izkar@31616@1
    Line 832: 17753:kunT:kunT:kuntkunT:37381 ##NEQ kunT@kunT@kunT@37381@1
    Line 1064: 20315:kzip:nis:nizkzip:43003 ##NEQ kzip@nis@niHkzip@43003@1
    Line 1065: 20315:kzip:vinis:vinizkzip:43005 ##NEQ kzip@vinis@viniHkzip@43005@9
    Line 1183: 21814:gam:aram:araNgam:46264 ##NEQ gam@aram@aramgam@46264@9
    Line 1187: 21814:gam:astam:astaNgam:46272 ##NEQ gam@astam@astamgam@46272@9
    Line 1347: 22587:guRay:anuguRita:anuguRitaguRay:48143 ##NEQ guRay@anuguRita@anuguRita@48143@1
    Line 1349: 22587:guRay:praguRita:praguRitaguRay:48147 ##NEQ guRay@praguRita@praguRita@48147@1
    Line 1529: 24519:cat:nis:niScat:52401 ##NEQ cat@nis@nizcat@52401@9
    Line 1541: 24902:car:antar:antaScar:53202 ##NEQ car@antar@antaHcar@53202@9
    Line 1570: 24902:car:dus:duScar:53262 ##NEQ car@dus@duzcar@53262@9
    Line 1571: 24902:car:nis:niScar:53264 ##NEQ car@nis@nizcar@53264@9
    Line 1572: 24902:car:vinis:viniScar:53266 ##NEQ car@vinis@vinizcar@53266@9
    Line 1597: 24974:cart:nis:niScart:53462 ##NEQ cart@nis@nizcart@53462@9
    Line 1644: 25298:ci:nis:niSci:54209 ##NEQ ci@nis@nizci@54209@9
    Line 1645: 25298:ci:aBinis:aBiniSci:54211 ##NEQ ci@aBinis@aBinizci@54211@9
    Line 1646: 25298:ci:avanis:avaniSci:54213 ##NEQ ci@avanis@avanizci@54213@9
    Line 1647: 25298:ci:vinis:viniSci:54215 ##NEQ ci@vinis@vinizci@54215@9
    Line 1665: 25554:cint:nis:niScint:54768 ##NEQ cint@nis@nizcint@54768@9
    Line 1706: 26032:cyu:nis:niScyu:55808 ##NEQ cyu@nis@nizcyu@55808@9
    Line 1715: 26080:Cad:anu:anucCad:55924 ##NEQ Cad@anu@anuCad@55924@9
    Line 1716: 26080:Cad:aBi:aBicCad:55926 ##NEQ Cad@aBi@aBiCad@55926@9
    Line 1718: 26080:Cad:ava:avacCad:55930 ##NEQ Cad@ava@avaCad@55930@9
    Line 1720: 26080:Cad:A:AcCad:55934 ##NEQ Cad@A@ACad@55934@1
    Line 1725: 26080:Cad:upa:upacCad:55944 ##NEQ Cad@upa@upaCad@55944@9

funderburkjim commented 7 years ago

After the extra prefixes incorporated into PWGehw3 (and before any other adjustments), the baseline stats of compare.txt are:

7633 prefixed headwords in both, spellings the same
1011 prefixed headwords in both, spellings different
0 prefixed headwords only in preverb1a.txt
0 prefixed headwords only in ../PWG/PWGehw3.txt

So, the two lists of prefixes is identical.

Will next tackle differences in nasals in the comparison.

gasyoun commented 7 years ago

Will next tackle differences in nasals in the comparison.

Yeah, that's a real battleground and there seems to be no end to it.

funderburkjim commented 7 years ago

First skirmish: apply hwnorm1c headword normalization logic to those cases not matched. This cuts the problem list almost in half. The ones that match with only this normalization are marked ##EQNORM in compare.txt. Only compare.py was altered.

8644 records written to compare.txt
8114 prefixed headwords in both, spellings the same
 Of these, 481 have same spellings AFTER HWNORM1C normalization
530 prefixed headwords in both, spellings different

funderburkjim commented 7 years ago

One way to analyze the remaining 530 cases where the spelling differs for the implied prefixed verb (between PWGpreverb and PWGehw3) is to use the preverb1b_mw.txt file of spellings. Recall these are cases where a match was found between the PWGpreverb spelling and an MW root spelling.

This file is the result of the comparison: temp_compare_mw.txt

When this is done, the list of 530 is separated into about two parts of almost the same size:

261 cases where the preverb spelling matches MW spelling. These cases indicate a difference between the PWGehw3 spelling and MW. These are marked ##PVMW in temp_compare_mw.
- in one case at least, prAn MW has both spellings, but with some other preverbs of 'an' MW shows only the 'AR' spelling, so these cases argue for a change to PWGpreverb spelling.
269 case where the preverb spelling does NOT match an MW spelling. These cases are good candidates for a revision to the PWGpreverb spelling. These cases are marked ##NOTMW in temp_compare_mw.

I'll focus on the NOTMW cases, and @drdhaval2785 might focus on the PVMW cases.

gasyoun commented 7 years ago

First skirmish: apply hwnorm1c headword normalization logic to those cases not matched

Well done, was experimenting on that logic locally.

I'll focus on the NOTMW cases, and @drdhaval2785 might focus on the PVMW cases.

@drdhaval2785 please, please, please :pray:

drdhaval2785 commented 7 years ago

Bite size has become chewable now. Will try to complete it today.

drdhaval2785 commented 7 years ago

@funderburkjim Sandhi rules in ehw3.txt was a bit primitive, as I didn't have any ready module for sandhi. So they were a bag of regexes as they came to my mid. Scharfsandhi seems to be a bit advanced. So ignore my output. Both methods have been clashed and yours is better. So let us dump ehw3.txt and go ahead with preverb1a.

gasyoun commented 7 years ago

Scharfsandhi seems to be a bit advanced.

Good to know that we have it.

funderburkjim commented 7 years ago

dump ehw3.txt and go ahead with preverb1a.

OK. I think it was helpful to have the independent approaches initially.

I noticed that ehw3 is different from yesterday and regenerated the comparison, Now there are are

426 (530 previously) differences between ehw3 and preverb.
redoing the comparison to mw of these 426, temp_compare_mw.txt
- 184 have preverb spelling matched to MW.
- 242 of these not matched with MW

I have yet to look at these; my next task.

sanskrit-lexicon / alternateheadwords

PWGpreverb #12

steps in the analysis.