funderburkjim commented 2 years ago

In #140, it was mentioned that there are many errors in the coding of accents in the CDSl version of MW. This issue devoted to correcting these errors.

It is reasonable to restrict to headwords. The 'k2' (key2) field in the metaline shows accents.

107802 matches for "<k2>.*/" in buffer: mw.txt` udAtta accents.
114 matches for "<k2>.*\^" in buffer: mw.txt  svarita accents

In pwg,
470 matches for "<k2>.*\^" in buffer: pwg.txt
20809 matches for "<k2>.*/" in buffer: pwg.txt

In pw:
17929 matches for "<k2>.*/" in buffer: pw.txt
293 matches for "<k2>.*\^" in buffer: pw.txt

We can assume there should be consistency in accent between MW and the Boehtlingk dictionaries (PW, PWG).

A reasonable first step might be to look at the svarita accents. For instance:

pw: <L>12716<pc>1151-1<k1>asurya<k2>asurya^<e>100
mw: <L>21088<pc>121,2<k1>asurya<k2>asurya^<h>1<e>2

We could do such a comparison by program and print out the exceptions for hand examination.

funderburkjim commented 2 years ago

See https://github.com/sanskrit-lexicon/MWS/issues/137#issuecomment-1251359197 for another approach to detecting accent problems.

drdhaval2785 commented 2 years ago

Kindly look at the following entry SapaTya in both PWG and MW

PWG

<L>97768<pc>7-0062<k1>SapaTya<k2>SapaTya^
{#SapaTya^#}¦ (wie eben) <lex>adj.</lex> {%auf Fluch beruhend%} 
<ls>ṚV. 10, 97, 16.</ls>
<LEND>

PWG display

शपथ्य [Printed book page [7-0062](https://www.sanskrit-lexicon.uni-koeln.de/scans/csl-apidev/servepdf.php?dict=PWG&page=7-0062)]
शपथ्य॑ (wie eben) adj. auf Fluch beruhend [Ṛv. 10, 97, 16.](https://sanskrit-lexicon.github.io/rvlinks/rvhymns/rv10.097.html#rv10.097.16)                  [ID=97768]

MW

<L>212560<pc>1052,1<k1>SapaTya<k2>SapaTya/<e>2
<s>SapaTya/</s> ¦ <lex>mfn.</lex> depending on a curse, (a sin) consisting in cursing or imprecation, <ls>RV.</ls><info lex="m:f:n"/>
<LEND>

MW display


(H2) [Printed book page [1052](https://www.sanskrit-lexicon.uni-koeln.de/scans/csl-apidev/servepdf.php?dict=MW&page=1052),1]
शपथ्य॑ mfn. depending on a curse, (a sin) consisting in cursing or imprecation, RV.  [ID=212560]

Note the PWG SapaTya^ versus MW SapaTya/ SapaTya ends with 'a' in svarita. This sure is confusing. For dictionaries PWG and MW, we should have consistent SLP1 encoding.

drdhaval2785 commented 2 years ago

MW typo examples

MW data

<L>212560<pc>1052,1<k1>SapaTya<k2>SapaTya/<e>2
<s>SapaTya/</s> ¦ <lex>mfn.</lex> depending on a curse, (a sin) consisting in cursing or imprecation, <ls>RV.</ls><info lex="m:f:n"/>
<LEND>
<L>212561<pc>1052,1<k1>Sapana<k2>Sa/pana<e>2
<s>Sa/pana</s> ¦ <lex>n.</lex> a curse, imprecation, <ls>AV.</ls><info lex="n"/>
<LEND>

MW snippet

Screenshot_2022-09-20_11-18-57

MW marked both with '/', whereas they are different.

PWG data

<L>97768<pc>7-0062<k1>SapaTya<k2>SapaTya^
{#SapaTya^#}¦ (wie eben) <lex>adj.</lex> {%auf Fluch beruhend%} 
<ls>ṚV. 10, 97, 16.</ls>
<LEND>

<L>97769<pc>7-0062<k1>Sapana<k2>Sa/pana
{#Sa/pana#}¦ (von {#Sap#}) <lex>n.</lex> = {#SapaTa#} 
<ls>AK. 1, 1, 5, 10.</ls> 
<ls>H. 262.</ls> {%Fluch%} 
<ls>TRIK. 3, 2, 9.</ls> 
<ls>AV. 1, 28, 3.</ls>
<LEND>

PWG snippet

Screenshot_2022-09-20_11-20-28

PWG shows them both to have different.

Andhrabharati commented 2 years ago

Good to see that @drdhaval2785 is also finding the 'inconsistency' issues across the dictionaries' data now.

A standing rule to be followed would be that whatever is the internal representation of the text (file) data is, the end result (display or otherwise) should tally with the printed matter, whether it is in the dictionary itself or in the reference work that it is citing from.

[I would have recommended Dhaval to post the citation matter from the RV and AV as well (as the case may be), to make the argument further strong/appealing. Ultimately they are the ones that are to be referred to, the dictionaries or others are just helping to reach them.]

drdhaval2785 commented 2 years ago

https://github.com/sanskrit-lexicon/MWS/blob/master/accent_diff/log.tsv - The highest priority accent differences Entries are in headword AccentInMW AccentInPWG format

https://github.com/sanskrit-lexicon/MWS/blob/master/accent_diff/log.html - HTML can be downloaded and checked manually if needed.

drdhaval2785 commented 2 years ago

Once this is done, we can go to the next step. That is because of the compound issues.

TSV file - https://github.com/sanskrit-lexicon/MWS/blob/master/accent_diff/log_with_compounds.tsv HTML file - https://github.com/sanskrit-lexicon/MWS/blob/master/accent_diff/log_with_compounds.html

examples

aMSaBU	a/MSaBU/	aMSaBU/

Here, the headword is a/MSa. When it is used in compound, because of rules governing accent to compounds, it becomes aMSaBU/. PWG correctly captures this.

As the compound parsing was done through some program in MW, the accent portion of it was not properly handled or could not be properly handled. Therefore, it gave rise to a/MSaBU/ instead of aMSaBU/

We need to convert it back to aMSaBU/ as per PWG. For headwords which occur in both MW and PWG, it may be an easy programmatic solution, but not sure how to handle the cases where there is no matching headword in PWG.

My hunch is that we should keep the accent as per the second part of the compound, and ignore the accent of first part of compound. Not sure.

Andhrabharati commented 2 years ago

My hunch is that we should keep the accent as per the second part of the compound, and ignore the accent of first part of compound. Not sure.

In general this is the principle to be followed, @drdhaval2785 ; in cases where the accent needs to be retained on the first part, the print has invariably mentioned it just before its (entry word's) lexical info (gender or otherwise).

I see that quite many of those portions are missing in the MW digitisation, though occasionally present scattered (just like the nom. case endings that I was talking about all these days).

Andhrabharati commented 2 years ago

See @gasyoun , Dhaval has come out now with two lists (499 + 3169) counting to about 3600 entries, corroborating my estimate of more than couple of thousands as posted at https://github.com/sanskrit-lexicon/MWS/issues/140#issuecomment-1250366037.

[I am quite sure there still would be more entries in the text, that need to be identified and corrected.]

drdhaval2785 commented 2 years ago

Slight correction. 499 is subset of 3169, and not in addition thereto. So total 3169 diffferences. Quite sizeable.

Andhrabharati commented 2 years ago

For headwords which occur in both MW and PWG, it may be an easy programmatic solution, but not sure how to handle the cases where there is no matching headword in PWG.

First option is to look for those in pwk, @drdhaval2785, and then in the pwkvn and also the "leftout" VN portions of PWG volumes 1-4 and 6. There are quite many in those pages, that did not come in PWG VN pages of Vol. 5 and Vol.7 (Jim was thinking the case to be otherwise with some random checks; I had checked all those entries and found that Jim was wrong, but did not pursue the matter with him! Much against my nature, to see the matter to reach its 'proper' end!!!)

drdhaval2785 commented 2 years ago

On cursory look at https://github.com/sanskrit-lexicon/MWS/blob/master/accent_diff/log.tsv, it seems that large chunk of it ends with ya. I am not sure whether there is some programmatic oddity which gave rise to this, or there is some grammatical rule which allows optional accents with words ending in suffices ending with ya. Just noting it here, so that some grammatically inclined person can have a look.

drdhaval2785 commented 2 years ago

First option is to look for those in pwk, @drdhaval2785 (and then in the pwkvn and also the "leftout" VN portions of PWG volumes 1-4.

Seems a reasonable way. So hierarchy is PWG -> PWK -> PWKVN -> PWGVN Pardon my ignorance about PWKVN and PWGVN. Have not seen them at all.

Andhrabharati commented 2 years ago

Pardon my ignorance about PWKVN and PWGVN. Have not seen them at all.

These VN pages are the Additions and Corrections to pwk and PWG volumes [printed at the end of respective volume or in the later volume(s)].

pwkvn (of all the 7 volumes) is hosted under a separate repo, as Jim has his own reasons to not to club to the pwk text, as in the case of every other CDSL work. I had proposed him once to combine it and then left the matter.

After my pointing out the matter being missed altogether, there were some trials to derive the pwkvn data from SCH data, but finally it was decided to completely get those pages retyped. Jim seems to have funded (James Funderburk > Fund; hope Jim does not mind my saying thus) the digitisation expenses, as per Thomas.

PWG VN portions of Vol. 5 and Vol. 7 are after the PWG main portions of those two volumes respectively. The other volumes' VN data is lying in some old version of PWG, which came out in my 'dugging' the old folders, and I had even posted the data completely 'proofed' ; they just amount to some 1000+ entries/lines.

Andhrabharati commented 2 years ago

Slight correction. 499 is subset of 3169, and not in addition thereto. So total 3169 diffferences. Quite sizeable.

I did not look at the items at all in the two lists, just seen the numbers.

Thought they would have been different, looking at this line--

Once this is done, we can go to the next step.

[I did not expect that the entries would have been repeated in another list, once 'done' in a list, as indicated in the above statement.]

vvasuki commented 2 years ago

Good to see that @drdhaval2785 is also finding the 'inconsistency' issues across the dictionaries' data now.

A standing rule to be followed would be that whatever is the internal representation of the text (file) data is, the end result (display or otherwise) should tally with the printed matter, whether it is in the dictionary itself or in the reference work that it is citing from.

But, it should not be the only "end result", or there would be no question of devanAgarI headwords for MW etc.. It is desirable, as mentioned elsewhere (https://github.com/sanskrit-lexicon/csl-ldev/issues/7#issuecomment-1249280796) to additionally (and prominently) show the accent in a standardized format.

On cursory look at https://github.com/sanskrit-lexicon/MWS/blob/master/accent_diff/log.tsv, it seems that large chunk of it ends with ya. I am not sure whether there is some programmatic oddity which gave rise to this, or there is some grammatical rule which allows optional accents with words ending in suffices ending with ya. Just noting it here, so that some grammatically inclined person can have a look.

@drdhaval2785 This is a matter of jAtya-svarita - or svarita arising from internal sandhi and not as a consequence of following an udAtta. In such words, instead of an udAtta setting the tone, you have a svarita. It occurs only after ya or va. For example, in case of shapathya, Bohtlingk-Sanskrit-Worterbuch-in-kurzerer-Fassung decyphers it as शपथि꣫अ . So, should be easy to detect programmatically.

drdhaval2785 commented 2 years ago

Good point raised @vvasuki . Thanks.

vvasuki commented 2 years ago

Ultimately they are the ones that are to be referred to, the dictionaries or others are just helping to reach them.

An important point of clarification regarding the above. RV, SV, AV etc.. are NOT the only source of svara-s (as the bhAShyakAra says - it's impossible to list all sAdhu-shabda-s) - we have vyAkaraNa to deduce svara-s (which are incidentally a must in truly "proper" laukika speech as per shAstra). So, it is not inconceivable that the dict makers would have used the very same sUtras that we refer to today when determining svara-s in some cases.

Andhrabharati commented 2 years ago

For headwords which occur in both MW and PWG, it may be an easy programmatic solution, but not sure how to handle the cases where there is no matching headword in PWG.

First option is to look for those in pwk, @drdhaval2785, and then in the pwkvn and also the "leftout" VN portions of PWG volumes 1-4 and 6.

@funderburkjim would you mind writing another "comparative display" program, to show MW | PWG | pwk + pwkvn (no need of having SCH in this case) in one screen, similar to https://sanskrit-lexicon.uni-koeln.de/scans/csl-apidev/pwkvn/03/?

[I was thinking of asking you this for many days now, but waiting for a suitable time.]

gasyoun commented 2 years ago

We can assume there should be consistency in accent between MW and the Boehtlingk dictionaries (PW, PWG).

As there is ERRATA not implemented, not always so.

and then in the pwkvn and also the "leftout" VN portions of PWG volumes 1-4 and 6.

Adresses that errata portion.

My hunch is that we should keep the accent as per the second part of the compound, and ignore the accent of first part of compound. Not sure.

And it is stated by Dhaval who did work on accent issues programmaticaly years ago.

I see that quite many of those portions are missing in the MW digitisation, though occasionally present scattered (just like the nom. case endings that I was talking about all these days).

A single sample?

[I am quite sure there still would be more entries in the text, that need to be identified and corrected.]

We do not care about the text yet, only headwords.

bhAShyakAra says - it's impossible to list all sAdhu-shabda-s

Can you trace the statement, please?

So, it is not inconceivable that the dict makers would have used the very same sUtras that we refer to today when determining svara-s in some cases.

They used only Vedic svaras, they did not determine NOTHING.

vvasuki commented 2 years ago

So, it is not inconceivable that the dict makers would have used the very same sUtras that we refer to today when determining svara-s in some cases.

They used only Vedic svaras, they did not determine NOTHING.

How are you so sure? Whitney deals with svara-s quite well in his grammar. They would be dumb to not have used simple rules which they would have doubtless encountered via sAyaNa's commentary and native informants.

My hunch is that we should keep the accent as per the second part of the compound, and ignore the accent of first part of compound. Not sure.

And it is stated by Dhaval who did work on accent issues programmaticaly years ago.

Accent in compounds is not a "one rule for all" thing, if that's what you're talking about. Major rules are summarized here .

Best to show the accents of parts separately in case accent for the whole cannot be determined by lookup.

bhAShyakAra says - it's impossible to list all sAdhu-shabda-s

Can you trace the statement, please?

Vaguely recalled from paspashAhnika of mahAbhAShya, but could not find it exactly now. Just found this, which implies that - लघ्वर्थं चाध्येयं व्याकरणम् । 'ब्राह्मणेनावश्यं शब्दा ज्ञेयाः' इति, न चान्तरेण व्याकरणं लघुनोपायेन शब्दाः शक्या विज्ञातुम् ॥ . So, might be in kaiyaTa's comment.

vvasuki commented 2 years ago

Can you trace the statement, please?

Vaguely recalled from paspashAhnika of mahAbhAShya, but could not find it exactly now. Just found this, which implies that - लघ्वर्थं चाध्येयं व्याकरणम् । 'ब्राह्मणेनावश्यं शब्दा ज्ञेयाः' इति, न चान्तरेण व्याकरणं लघुनोपायेन शब्दाः शक्या विज्ञातुम् ॥ . So, might be in kaiyaTa's comment.

Found it -

अथैतस्मिञ् शब्दोपदेशे सति किं शब्दानां प्रतिपत्तौ प्रतिपदपाठः कर्तव्यः - गौरश्वः पुरुषो हस्ती शकुनिर् मृगो ब्राह्मण इत्येवमादयः शब्दाः पठितव्याः ?
नेत्याह । अनभ्युपाय एष शब्दानां प्रतिपत्तौ प्रतिपदपाठः ॥ एवं हि श्रूयते - 'बृहस्पतिर् इन्द्राय दिव्यं वर्षसहस्रं प्रतिपदोक्तानां शब्दानां शब्दपारायणं प्रोवाच नान्तं जगाम' ॥ बृहस्पतिश्च प्रवक्ता, इन्द्रश्चाध्येता, दिव्यं वर्षसहस्रमध्ययनकालः, न चान्तं जगाम । किं पुनरद्यत्वे ? यः सर्वथा चिरं जीवति - वर्षशतं जीवति । चतुर्भिश् च प्रकारैर् विद्योपयुक्ता भवति - आगम-कालेन, स्वाध्याय-कालेन, प्रवचन-कालेन, व्यवहार-कालेनेति । तत्र चास्यागमकालेनैवायुः पर्युपयुक्तं स्यात् । तस्माद् अनभ्युपायः शब्दानां प्रतिपत्तौ प्रतिपदपाठः॥+++(4)+++
कथं तर्हीमे शब्दाः प्रतिपत्तव्याः? किंचित् सामान्य-विशेषवल्-लक्षणं प्रवर्त्यम् । येनाल्पेन यत्नेन महतो महतः शब्दौघान् प्रतिपद्येरन् ॥ किं पुनस् तत् ? उत्सर्गापवादौ । कश्चिदुत्सर्गः कर्तव्यः, कश्चिदपवादः ॥

Andhrabharati commented 2 years ago

For headwords which occur in both MW and PWG, it may be an easy programmatic solution, but not sure how to handle the cases where there is no matching headword in PWG.

I just recalled why I said I have no great expectations in programmatic approach to @funderburkjim in the other (parent) issue.

There are quite some entries that were to be corrected for accents in both PWG and pwk (& pwkvn).

So even if the metalines' k2 entries are compared between MW and these, those VN forms still remain uncaught, for those are lying in the body portion still, and not carried into the HW portion yet.

It was with my intervention that this correction has happened in just MW (last year), from its annexure data.

@drdhaval2785 and @funderburkjim may think of getting some means to cover this point in a programmatic way.

[I have some other points at the back of my mind, and would post subsequently at some time.]

Andhrabharati commented 2 years ago

A single sample

I see no use of giving any example, @gasyoun !!

I don't have to prove here that I have looked into the print (pdf) and text (cdsl file) data close enough.

Andhrabharati commented 2 years ago

Just opened the two log files by @drdhaval2785 , and noticed that neither of them contain the 'aMhu' entry.

MW has it as two parts <L>126<pc>1,2<k1>aMhu<k2>aMhu<e>2 <L>127<pc>1,2<k1>aMhu<k2>aMhu<e>2B

Incidentally the 2nd part <L>127 is to be with the acute accent as per MW, not the 1st part <L>126 (which is with udAtta accent in PWG and pwk)

Also for future ref. (if it ever happens!), note the accent difference in the cross-referred word, <s>paro/-'Mhu</s> in MW (<L>126 body portion) as against {#paroMhu#} in PWG VN (<L>62430 body portion); and CDSL MW text has this word marked with acute mark (as compared to the print having the grave mark; PWG suggests no accent at all).

Though MW99 has picked up much of its data from the Boethlingk's dictionaries, undoubtedly it did take help from many other sources and also has some independent work (I would estimate the ratio as 75:25 roughly, for the above two portions); thus, we cannot always take Boethlingk as the ultimate authority!

PWG has-- <L>55<pc>1-0007<k1>aMhu<k2>aMhu/

pwk has-- <L>54<pc>1001-3<k1>aMhu<k2>aMhu/<e>100

Leaving the actual differences between MW and PWG/pwk accents (as above) aside, the main point is that, the 'logic' used in identifying the differences programmatically needs some 'refining'.

Andhrabharati commented 2 years ago

A single sample

I see no use of giving any example, @gasyoun !!

I don't have to prove here that I have looked into the print (pdf) and text (cdsl file) data close enough.

@gasyoun Just not to leave your 'wish' unfulfilled, here is a case showing a correct accent and a wrong accent (wrt the print) in the (suggested) portions in CDSL text data (however neither of these got applied to the resp. HW!!)-

----------------- @drdhaval2785 these are the (suggested) accents in the first part of the compound words in the MW print.

I was referring to all such cases, in my post above-- https://github.com/sanskrit-lexicon/MWS/issues/141#issuecomment-1251973696

Andhrabharati commented 2 years ago

Another (minor) discrepancy of CDSL data wrt the print can be seen in the above snippet-- while the first word 116701 has the accent info preceding the lexical info, the second word 116702 has it following the lex. info!!

The print is consistently having the accent info before the lex. info all through its pages (there might be some cases in opposite, but they would surely be rare).

Andhrabharati commented 2 years ago

Slight correction. 499 is subset of 3169, and not in addition thereto. So total 3169 diffferences. Quite sizeable.

@drdhaval2785, @gasyoun

Do you recall that at one time (about a year back?) I was saying that my estimate was that ~2% of HWs would be needing correction, which you were thinking to be 'a bit too much' of estimation (with so much of work/cleaning done over so many years)?

Whether it is the error in spelling or in accent, after all it is an error; and keep in mind that MW has wrote a spl. note about the accents in his introduction, citing its importance specifically in Sanskrit.

[I do not like to say this-- but as I am looking deep into the text, I am finding that more errors got introduced into the text, as against the cleaning part, esp. while tagging various entities; in summary, my feeling is that it was more of tagging that took place than cleaning the MW text.]

Andhrabharati commented 2 years ago

[I am quite sure there still would be more entries in the text, that need to be identified and corrected.]

We do not care about the text yet, only headwords.

Even I was talking about HWs only, @gasyoun [in my words, text is the typed matter (whether it is HWs part or the rest); when I mean the meaning(s) part, I would be specifically saying body portion]!!

gasyoun commented 2 years ago

Do you recall that at one time (about a year back?) I was saying that my estimate was that ~2% of HWs would be needing correction, which you were thinking to be 'a bit too much' of estimation (with so much of work/cleaning done over so many years)?

Yes.

Whether it is the error in spelling or in accent, after all it is an error; and keep in mind that MW has wrote a spl. note about the accents in his introduction, citing its importance specifically in Sanskrit.

Fully agree.

[I do not like to say this-- but as I am looking deep into the text, I am finding that more errors got introduced into the text, as against the cleaning part, esp. while tagging various entities; in summary, my feeling is that it was more of tagging that took place than cleaning the MW text.]

A small portion of tagging is finalised. So I do believe most of the time was spent on headwords. It still remains the top priority for me personally. And I'm happy and indebted to @Andhrabharati undertaking this accent battle as well. All the tags come after. And it will be soon 10 years as we've started the headword cleaning battle.

Andhrabharati commented 2 years ago

So I do believe most of the time was spent on headwords.

And it will be soon 10 years as we've started the headword cleaning battle.

I don't think if 'cleaning' just the HWs portion would take more than 4-5 weeks (at the max.), and I am hearing here that the exercise is still incomplete (for 10 years now)!

Andhrabharati commented 2 years ago

[I would've had happily done this long back, had I got what I wanted few months back; now my mind is fully on some other task.]

gasyoun commented 2 years ago

HWs portion would take more than 4-5 weeks (at the max.)

5 weeks of everyday work and still too little amount of time.

my mind is fully on some other task

non-Sanskrit?

funderburkjim commented 2 years ago

mw svarita corrections from pwg

This work was done in issue141 directory.

As mentioned above, there are many more metalines in pwg with svarita accents than in mw:

114 matches for "<k2>.*\^" in buffer: mw.txt  svarita accents
470 matches for "<k2>.*\^" in buffer: pwg.txt

The 114 in mw were compared to the printed mw, and a few corrections made. See change_mw_1.txt. Then, for each of the 470 pwg entries, the corresponding mw entries were compared to printed mw, and changes made. See change_mw_2.txt. There was also one typo corrected in pwg. The analysis uses variations of @drdhaval2785 's find_accent_diff.py . After these changes to mw,

849 matches for "<k2>.*\^" in buffer: temp_mw_2.txt

svarita_mw_2.txt lists these metalines.

There are 81 additional cases (see See ad2arev.txt ) where pwg shows a svarita accent, but either

no corresponding MW headword is noted, OR
the corresponding MW headword has no accent. (and is not included in the 470)
- The significance of this difference between pwg and mw is unknown (to me).

There are numerous (about 125) cases where MW has, in addition to a svarita-accented form, also an unaccented form. For example namasya:

An iast version of the revised (temp_mw_2.txt) mw: mw_2_svarita_iast.zip

gasyoun commented 2 years ago

There are 81 additional case

Interesting indeed, so MW is not a pure copycat.

funderburkjim commented 2 years ago

possible next step: inheritance

71873 matches for "<k2>.*[\/^].*[-—]" in buffer: temp_mw_2.txt For example, aMSa has an accent, and this accent is, in the CDSL coding, 'inherited' by compounds of aMSa.

<L>10<pc>1,1<k1>aMSa<k2>a/MSa<e>1
 ...
<L>20<pc>1,1<k1>aMSakaraRa<k2>a/MSa—karaRa<e>3
<L>21<pc>1,1<k1>aMSakalpanA<k2>a/MSa—kalpanA<e>3
<L>22<pc>1,1<k1>aMSaprakalpanA<k2>a/MSa—prakalpanA<e>3
 etc.

I think this 'accent inheritance in compounds' principle of CDSL is likely wrong in general. For instance <k2>a/MSa—karaRa should be changed to <k2>aMSa—karaRa (remove accent).

Should the principle be? Always remove inherited accents in compounds unless MW specifically says to use them. For example, the inherited accent 'sva^r' should be removed in svargiri and svarjit, but retained in svarcakzas and svarcanas:

<L>259095<pc>1281,1<k1>svar<k2>sva^r<h>4<e>2
 ...
<L>259109<pc>1281,2<k1>svargiri<k2>sva/r—giri<e>3   to change to svar—giri
<L>259110<pc>1281,2<k1>svarcakzas<k2>sva^r—cakzas<e>3  ok
<L>259111<pc>1281,2<k1>svarcanas<k2>sva^r—canas<e>3   ok
L>259112<pc>1281,2<k1>svarjit<k2>sva/r—ji/t<h>a<e>3    to change to svar—ji/t

vvasuki commented 2 years ago

71873 matches for "<k2>.*[\/^].*[-—]" in buffer: temp_mw_2.txt For example, aMSa has an accent, and this accent is, in the CDSL coding, 'inherited' by compounds of aMSa.
<L>10<pc>1,1<k1>aMSa<k2>a/MSa<e>1
 ...
<L>20<pc>1,1<k1>aMSakaraRa<k2>a/MSa—karaRa<e>3
<L>21<pc>1,1<k1>aMSakalpanA<k2>a/MSa—kalpanA<e>3
<L>22<pc>1,1<k1>aMSaprakalpanA<k2>a/MSa—prakalpanA<e>3
 etc.
I think this 'accent inheritance in compounds' principle of CDSL is likely wrong in general. For instance <k2>a/MSa—karaRa should be changed to <k2>aMSa—karaRa (remove accent).

In all these particular cases, accent actually would lie in the second part of the compound. For bahuvrIhi compounds (and a few other exceptions), the first constituent's accent would be retained. This is not possible to determine programmatically. So, indeed, it is a good idea to remove accent from both parts <k2>aMSa—karaRa. However, for the convenience of those who care for accents, it would be a great idea to provide independent accents for both the constituents (by adding a string like (a/MSa + ka/raRa)), so that they can work out the final accent of the compound in their heads.

Of course, the accent of the first constituent is easily available, and that of the second part may or may not be determined without ambiguity by a further lookup (eg. both करण॑ and क॑रण exist). So, the accent of the second part can be shown only in unambiguous cases.

Should the principle be? Always remove inherited accents in compounds unless MW specifically says to use them. For example, the inherited accent 'sva^r' should be removed in svargiri and svarjit, but retained in svarcakzas and svarcanas:

Sounds like a good idea!

Andhrabharati commented 2 years ago

There are 81 additional case

Interesting indeed, so MW is not a pure copycat.

See my post above at https://github.com/sanskrit-lexicon/MWS/issues/141#issuecomment-1254703229

Though MW99 has picked up much of its data from the Boethlingk's dictionaries, undoubtedly it did take help from many other sources and also has some independent work (I would estimate the ratio as 75:25 roughly, for the above two portions); thus, we cannot always take Boethlingk as the ultimate authority!

"and also has some independent work"

Andhrabharati commented 2 years ago

Should the principle be? Always remove inherited accents in compounds unless MW specifically says to use them. For example, the inherited accent 'sva^r' should be removed in svargiri and svarjit, but retained in svarcakzas and svarcanas:

I had posted several messages above, on the same point--

https://github.com/sanskrit-lexicon/MWS/issues/141#issuecomment-1251973696

https://github.com/sanskrit-lexicon/MWS/issues/141#issuecomment-1254753999

https://github.com/sanskrit-lexicon/MWS/issues/141#issuecomment-1254781142

Andhrabharati commented 2 years ago

There are numerous (about 125) cases where MW has, in addition to a svarita-accented form, also an unaccented form.

That is how it is! The accent would change at different contexts, and also at different 'lexical' forms. [Sometimes, even the same lexical form could be having different accents!]

gasyoun commented 2 years ago

it would be a great idea to provide independent accents for both the constituents (by adding a string like (a/MSa + ka/raRa)), so that they can work out the final accent of the compound in their heads.

@Andhrabharati that does not make much sense to me. To give something wrong, that one needs to recalculate in his head.

So, the accent of the second part can be shown only in unambiguous cases.

Programmatically?

75:25 roughly

Missed that one before.

vvasuki commented 2 years ago

it would be a great idea to provide independent accents for both the constituents (by adding a string like (a/MSa + ka/raRa)), so that they can work out the final accent of the compound in their heads.

@Andhrabharati that does not make much sense to me. To give something wrong, that one needs to recalculate in his head.

Saying Water (←H₂ + O₂) instead of ~~H₂O~~ water became wrong since when?

So, the accent of the second part can be shown only in unambiguous cases.

Programmatically?

Yes

75:25 roughly

Missed that one before.

funderburkjim commented 2 years ago

Phase 2

The focus here is on the MW headwords whose 'k2' differs from PWG, where PWG has an udAtta accent, and where MW has non-samAsa entries. (i.e., similar to prior phase, except here udAtta and prior phase was svarita).

The work is still in the issue141 directory. The mw change transactions are in change_mw_3.txt (about 600 lines changed) . Details can be seen in the commit above.

Expand k2 syntax

There are headwords where two accented variants are presented.

<L>6230<pc>32,1<k1>anugra<k2>a/n-ugra,an-ugra/<e>1    <<< NOTE THE COMMA
<s>a/n-ugra</s> or <s>an-ugra/</s> ¦ <lex>mf(<s>A</s>)n.</lex> not harsh or violent, mild, gentle, <ls>RV.</ls> &c.<info lex="m:f#A:n"/><info or="6230,anugra"/>

It was convenient to extend the metaline convention to allow a comma-delimited list for k2. See the sections singleton_or_and changes and temp_singleton_k2changes of change_mw_3. This resolved several of the udAtta accent differences with PWG. At this point, there were 350+ mw entries to compare with pwg (see ad3_rev.txt). The mw print was examined by hand, and the CDSL k2 markup classified as '+' (200+ CDSL agrees with print) or 'x' CDSL k2 markup may disagree with print (160+ cases). Then changes were made for the 'x' cases, see temp_change_mw_3b.txt section of change_mw_3.txt. After all the changes, there remain about 275 cases with udAtta accents classified as differing from PWG (out of about 5000 cases). These are shown in file ad3b_rev.txt.

some rules

As the task progressed, I tried to develop rules to handle cases where the accent(s) in mw is not obvious, but requires some sort of inference. Sometimes, these rules are referenced in change_mw_3 (e.g. 50+ instances of Rule 1). The rules are:

Rule 1: only one accent per headword. Drop accent inherited from parent.
Rule 2: parent Xa/Ya Child (<s>am</s>) Xa/yam (i.e., inherited). Example uttaram
Rule 3: do not inherit accent in compound (similar to Rule 1)

Interested parties may wish to examine (in change_mw_3) instances of these rules.

Thus far, I have found in mw print only one exception to the only one accent per headword rule. tAjadBaNga, and I changed that to agree with pwg and noted as an 'mw print change' .

next step

samAsa correction in mw.

funderburkjim commented 2 years ago

a long road

The 'programmatic' mw accent corrections appear to me to be at an end. Further corrections require manual review of mw.txt with the scans for all pages.
I've started this with pages 1-59.
Changes are in change_mw_6.txt. The time required for these 59 pages was about 3 days, or 20 pages per day.

At this rate, the total cleanup remaining will require 2-3 months.

Andhrabharati commented 2 years ago

I had 'sensed' this, much before starting the programmatic approach!!

If the latest iast file is made, I might be able to help in the next portion of the corrections. (after a few days probably)

I had also noticed some pc errors in the metalines, that could also be covered in the manual checking of HWs.

funderburkjim commented 2 years ago

@Andhrabharati Request you to do some random checking of the first batch of changes above, in case I need to make any mid-course corrections in method.

The main non-accent change in metalines that I've noticed is with the 'pc' value for the last item in a column. Quite often, the pc for this item incorrectly refers to the next column, and thus requires correction.

I'm also not examining the VN entries, since I believe you have previously corrected these, and I found no required corrections in the first few VN.

gasyoun commented 2 years ago

275 cases with udAtta accents classified as differing from PWG (out of about 5000 cases)

It will close the day when the Reverse Dictionary might get published thanks to such cleanup rounds.

pc for this item incorrectly refers to the next column, and thus requires correction.

Interesting to note

total cleanup remaining will require 2-3 months.

Major Tom calling for @Andhrabharati ))

Andhrabharati commented 2 years ago

@funderburkjim appears to have decided to work it out himself!!

[I had asked him to make the IAST file to do it; but he instead chose to continue the process with slp1, and has opened a new (continuation) issue]

Andhrabharati commented 2 years ago

And interestingly, seen that he is also filling up (some, if not all, of) the nom. case endings that I was talking about all these days for the past two years, that are missed/truncated in the current CDSL MW data!!

Probably, I might be able to do a full checkup once he finishes the process; though it takes his time, it definitely is a worthy spending at his end.

Andhrabharati commented 2 years ago

Probably @funderburkjim might close this issue, as another "continuation" issue is taken up now.

funderburkjim commented 2 years ago

nom. case endings

I am trying to do that mostly when it seems to give additional information for entries whose base form has an accent. One example is under uzRa/.

The (<s>as</s>) at the masculine form seems to give additional information (e.g., the masculine nom. singular is <s>uzRas</s> (we would write with visarga <s>uzRaH</s> but that's beside the point).) This is instead of the possibly expected <s>uzRa/s</s> .
So, most of the nominative case endings added by me are like this.

Here is an example where I didn't add back the nom. singular form.

'as' here is the normal nominative singular ending for a masculine noun whose citation endings in 'a'. And MW seems to me to be inconsistent in inserting the 'as'. For instance, there is no 'as' in uzmaka.

There would be no objection from me if, in his later review of mw, @Andhrabharati, he decides to be more thorough in adding to mw.txt the nominative endings which remain missing in the digitization.

sanskrit-lexicon / MWS

MW accent correction #141

MW data

MW snippet

PWG data

PWG snippet

mw svarita corrections from pwg

possible next step: inheritance

Phase 2

Expand k2 syntax

some rules

next step

a long road