Closed funderburkjim closed 2 years ago
See https://github.com/sanskrit-lexicon/MWS/issues/137#issuecomment-1251359197 for another approach to detecting accent problems.
Kindly look at the following entry SapaTya
in both PWG and MW
PWG
<L>97768<pc>7-0062<k1>SapaTya<k2>SapaTya^
{#SapaTya^#}¦ (wie eben) <lex>adj.</lex> {%auf Fluch beruhend%}
<ls>ṚV. 10, 97, 16.</ls>
<LEND>
PWG display
शपथ्य [Printed book page [7-0062](https://www.sanskrit-lexicon.uni-koeln.de/scans/csl-apidev/servepdf.php?dict=PWG&page=7-0062)]
शपथ्य॑ (wie eben) adj. auf Fluch beruhend [Ṛv. 10, 97, 16.](https://sanskrit-lexicon.github.io/rvlinks/rvhymns/rv10.097.html#rv10.097.16) [ID=97768]
MW
<L>212560<pc>1052,1<k1>SapaTya<k2>SapaTya/<e>2
<s>SapaTya/</s> ¦ <lex>mfn.</lex> depending on a curse, (a sin) consisting in cursing or imprecation, <ls>RV.</ls><info lex="m:f:n"/>
<LEND>
MW display
(H2) [Printed book page [1052](https://www.sanskrit-lexicon.uni-koeln.de/scans/csl-apidev/servepdf.php?dict=MW&page=1052),1]
शपथ्य॑ mfn. depending on a curse, (a sin) consisting in cursing or imprecation, RV. [ID=212560]
Note the PWG SapaTya^
versus MW SapaTya/
SapaTya ends with 'a' in svarita.
This sure is confusing. For dictionaries PWG and MW, we should have consistent SLP1 encoding.
MW typo examples
<L>212560<pc>1052,1<k1>SapaTya<k2>SapaTya/<e>2
<s>SapaTya/</s> ¦ <lex>mfn.</lex> depending on a curse, (a sin) consisting in cursing or imprecation, <ls>RV.</ls><info lex="m:f:n"/>
<LEND>
<L>212561<pc>1052,1<k1>Sapana<k2>Sa/pana<e>2
<s>Sa/pana</s> ¦ <lex>n.</lex> a curse, imprecation, <ls>AV.</ls><info lex="n"/>
<LEND>
MW marked both with '/', whereas they are different.
<L>97768<pc>7-0062<k1>SapaTya<k2>SapaTya^
{#SapaTya^#}¦ (wie eben) <lex>adj.</lex> {%auf Fluch beruhend%}
<ls>ṚV. 10, 97, 16.</ls>
<LEND>
<L>97769<pc>7-0062<k1>Sapana<k2>Sa/pana
{#Sa/pana#}¦ (von {#Sap#}) <lex>n.</lex> = {#SapaTa#}
<ls>AK. 1, 1, 5, 10.</ls>
<ls>H. 262.</ls> {%Fluch%}
<ls>TRIK. 3, 2, 9.</ls>
<ls>AV. 1, 28, 3.</ls>
<LEND>
PWG shows them both to have different.
Good to see that @drdhaval2785 is also finding the 'inconsistency' issues across the dictionaries' data now.
A standing rule to be followed would be that whatever is the internal representation of the text (file) data is, the end result (display or otherwise) should tally with the printed matter, whether it is in the dictionary itself or in the reference work that it is citing from.
[I would have recommended Dhaval to post the citation matter from the RV and AV as well (as the case may be), to make the argument further strong/appealing. Ultimately they are the ones that are to be referred to, the dictionaries or others are just helping to reach them.]
https://github.com/sanskrit-lexicon/MWS/blob/master/accent_diff/log.tsv - The highest priority accent differences
Entries are in headword AccentInMW AccentInPWG
format
https://github.com/sanskrit-lexicon/MWS/blob/master/accent_diff/log.html - HTML can be downloaded and checked manually if needed.
Once this is done, we can go to the next step. That is because of the compound issues.
TSV file - https://github.com/sanskrit-lexicon/MWS/blob/master/accent_diff/log_with_compounds.tsv HTML file - https://github.com/sanskrit-lexicon/MWS/blob/master/accent_diff/log_with_compounds.html
examples
aMSaBU | a/MSaBU/ | aMSaBU/ |
---|
Here, the headword is a/MSa
.
When it is used in compound, because of rules governing accent to compounds, it becomes aMSaBU/
.
PWG correctly captures this.
As the compound parsing was done through some program in MW, the accent portion of it was not properly handled or could not be properly handled.
Therefore, it gave rise to a/MSaBU/
instead of aMSaBU/
We need to convert it back to aMSaBU/
as per PWG.
For headwords which occur in both MW and PWG, it may be an easy programmatic solution, but not sure how to handle the cases where there is no matching headword in PWG.
My hunch is that we should keep the accent as per the second part of the compound, and ignore the accent of first part of compound. Not sure.
My hunch is that we should keep the accent as per the second part of the compound, and ignore the accent of first part of compound. Not sure.
In general this is the principle to be followed, @drdhaval2785 ; in cases where the accent needs to be retained on the first part, the print has invariably mentioned it just before its (entry word's) lexical info (gender or otherwise).
I see that quite many of those portions are missing in the MW digitisation, though occasionally present scattered (just like the nom. case endings that I was talking about all these days).
See @gasyoun , Dhaval has come out now with two lists (499 + 3169) counting to about 3600 entries, corroborating my estimate of more than couple of thousands as posted at https://github.com/sanskrit-lexicon/MWS/issues/140#issuecomment-1250366037.
[I am quite sure there still would be more entries in the text, that need to be identified and corrected.]
Slight correction. 499 is subset of 3169, and not in addition thereto. So total 3169 diffferences. Quite sizeable.
For headwords which occur in both MW and PWG, it may be an easy programmatic solution, but not sure how to handle the cases where there is no matching headword in PWG.
First option is to look for those in pwk, @drdhaval2785, and then in the pwkvn and also the "leftout" VN portions of PWG volumes 1-4 and 6. There are quite many in those pages, that did not come in PWG VN pages of Vol. 5 and Vol.7 (Jim was thinking the case to be otherwise with some random checks; I had checked all those entries and found that Jim was wrong, but did not pursue the matter with him! Much against my nature, to see the matter to reach its 'proper' end!!!)
On cursory look at https://github.com/sanskrit-lexicon/MWS/blob/master/accent_diff/log.tsv, it seems that large chunk of it ends with ya
. I am not sure whether there is some programmatic oddity which gave rise to this, or there is some grammatical rule which allows optional accents with words ending in suffices ending with ya
.
Just noting it here, so that some grammatically inclined person can have a look.
First option is to look for those in pwk, @drdhaval2785 (and then in the pwkvn and also the "leftout" VN portions of PWG volumes 1-4.
Seems a reasonable way. So hierarchy is PWG -> PWK -> PWKVN -> PWGVN Pardon my ignorance about PWKVN and PWGVN. Have not seen them at all.
Pardon my ignorance about PWKVN and PWGVN. Have not seen them at all.
These VN pages are the Additions and Corrections to pwk and PWG volumes [printed at the end of respective volume or in the later volume(s)].
pwkvn (of all the 7 volumes) is hosted under a separate repo, as Jim has his own reasons to not to club to the pwk text, as in the case of every other CDSL work. I had proposed him once to combine it and then left the matter.
After my pointing out the matter being missed altogether, there were some trials to derive the pwkvn data from SCH data, but finally it was decided to completely get those pages retyped. Jim seems to have funded (James Funderburk > Fund; hope Jim does not mind my saying thus) the digitisation expenses, as per Thomas.
PWG VN portions of Vol. 5 and Vol. 7 are after the PWG main portions of those two volumes respectively. The other volumes' VN data is lying in some old version of PWG, which came out in my 'dugging' the old folders, and I had even posted the data completely 'proofed' ; they just amount to some 1000+ entries/lines.
Slight correction. 499 is subset of 3169, and not in addition thereto. So total 3169 diffferences. Quite sizeable.
I did not look at the items at all in the two lists, just seen the numbers.
Thought they would have been different, looking at this line--
Once this is done, we can go to the next step.
[I did not expect that the entries would have been repeated in another list, once 'done' in a list, as indicated in the above statement.]
Good to see that @drdhaval2785 is also finding the 'inconsistency' issues across the dictionaries' data now.
A standing rule to be followed would be that whatever is the internal representation of the text (file) data is, the end result (display or otherwise) should tally with the printed matter, whether it is in the dictionary itself or in the reference work that it is citing from.
But, it should not be the only "end result", or there would be no question of devanAgarI headwords for MW etc.. It is desirable, as mentioned elsewhere (https://github.com/sanskrit-lexicon/csl-ldev/issues/7#issuecomment-1249280796) to additionally (and prominently) show the accent in a standardized format.
On cursory look at https://github.com/sanskrit-lexicon/MWS/blob/master/accent_diff/log.tsv, it seems that large chunk of it ends with
ya
. I am not sure whether there is some programmatic oddity which gave rise to this, or there is some grammatical rule which allows optional accents with words ending in suffices ending withya
. Just noting it here, so that some grammatically inclined person can have a look.
@drdhaval2785 This is a matter of jAtya-svarita - or svarita arising from internal sandhi and not as a consequence of following an udAtta. In such words, instead of an udAtta setting the tone, you have a svarita. It occurs only after ya or va. For example, in case of shapathya, Bohtlingk-Sanskrit-Worterbuch-in-kurzerer-Fassung decyphers it as शपथि꣫अ . So, should be easy to detect programmatically.
Good point raised @vvasuki . Thanks.
Ultimately they are the ones that are to be referred to, the dictionaries or others are just helping to reach them.
An important point of clarification regarding the above. RV, SV, AV etc.. are NOT the only source of svara-s (as the bhAShyakAra says - it's impossible to list all sAdhu-shabda-s) - we have vyAkaraNa to deduce svara-s (which are incidentally a must in truly "proper" laukika speech as per shAstra). So, it is not inconceivable that the dict makers would have used the very same sUtras that we refer to today when determining svara-s in some cases.
For headwords which occur in both MW and PWG, it may be an easy programmatic solution, but not sure how to handle the cases where there is no matching headword in PWG.
First option is to look for those in pwk, @drdhaval2785, and then in the pwkvn and also the "leftout" VN portions of PWG volumes 1-4 and 6.
@funderburkjim would you mind writing another "comparative display" program, to show MW | PWG | pwk + pwkvn (no need of having SCH in this case) in one screen, similar to https://sanskrit-lexicon.uni-koeln.de/scans/csl-apidev/pwkvn/03/?
[I was thinking of asking you this for many days now, but waiting for a suitable time.]
We can assume there should be consistency in accent between MW and the Boehtlingk dictionaries (PW, PWG).
As there is ERRATA not implemented, not always so.
and then in the pwkvn and also the "leftout" VN portions of PWG volumes 1-4 and 6.
Adresses that errata portion.
My hunch is that we should keep the accent as per the second part of the compound, and ignore the accent of first part of compound. Not sure.
And it is stated by Dhaval who did work on accent issues programmaticaly years ago.
I see that quite many of those portions are missing in the MW digitisation, though occasionally present scattered (just like the nom. case endings that I was talking about all these days).
A single sample?
[I am quite sure there still would be more entries in the text, that need to be identified and corrected.]
We do not care about the text yet, only headwords.
bhAShyakAra says - it's impossible to list all sAdhu-shabda-s
Can you trace the statement, please?
So, it is not inconceivable that the dict makers would have used the very same sUtras that we refer to today when determining svara-s in some cases.
They used only Vedic svaras, they did not determine NOTHING.
So, it is not inconceivable that the dict makers would have used the very same sUtras that we refer to today when determining svara-s in some cases.
They used only Vedic svaras, they did not determine NOTHING.
How are you so sure? Whitney deals with svara-s quite well in his grammar. They would be dumb to not have used simple rules which they would have doubtless encountered via sAyaNa's commentary and native informants.
My hunch is that we should keep the accent as per the second part of the compound, and ignore the accent of first part of compound. Not sure.
And it is stated by Dhaval who did work on accent issues programmaticaly years ago.
Accent in compounds is not a "one rule for all" thing, if that's what you're talking about. Major rules are summarized here .
Best to show the accents of parts separately in case accent for the whole cannot be determined by lookup.
bhAShyakAra says - it's impossible to list all sAdhu-shabda-s
Can you trace the statement, please?
Vaguely recalled from paspashAhnika of mahAbhAShya, but could not find it exactly now. Just found this, which implies that - लघ्वर्थं चाध्येयं व्याकरणम् । 'ब्राह्मणेनावश्यं शब्दा ज्ञेयाः' इति, न चान्तरेण व्याकरणं लघुनोपायेन शब्दाः शक्या विज्ञातुम् ॥ . So, might be in kaiyaTa's comment.
Can you trace the statement, please?
Vaguely recalled from paspashAhnika of mahAbhAShya, but could not find it exactly now. Just found this, which implies that - लघ्वर्थं चाध्येयं व्याकरणम् । 'ब्राह्मणेनावश्यं शब्दा ज्ञेयाः' इति, न चान्तरेण व्याकरणं लघुनोपायेन शब्दाः शक्या विज्ञातुम् ॥ . So, might be in kaiyaTa's comment.
Found it -
अथैतस्मिञ् शब्दोपदेशे सति किं शब्दानां प्रतिपत्तौ प्रतिपदपाठः कर्तव्यः - गौरश्वः पुरुषो हस्ती शकुनिर् मृगो ब्राह्मण इत्येवमादयः शब्दाः पठितव्याः ?
नेत्याह । अनभ्युपाय एष शब्दानां प्रतिपत्तौ प्रतिपदपाठः ॥ एवं हि श्रूयते - 'बृहस्पतिर् इन्द्राय दिव्यं वर्षसहस्रं प्रतिपदोक्तानां शब्दानां शब्दपारायणं प्रोवाच नान्तं जगाम' ॥ बृहस्पतिश्च प्रवक्ता, इन्द्रश्चाध्येता, दिव्यं वर्षसहस्रमध्ययनकालः, न चान्तं जगाम । किं पुनरद्यत्वे ? यः सर्वथा चिरं जीवति - वर्षशतं जीवति । चतुर्भिश् च प्रकारैर् विद्योपयुक्ता भवति - आगम-कालेन, स्वाध्याय-कालेन, प्रवचन-कालेन, व्यवहार-कालेनेति । तत्र चास्यागमकालेनैवायुः पर्युपयुक्तं स्यात् । तस्माद् अनभ्युपायः शब्दानां प्रतिपत्तौ प्रतिपदपाठः॥+++(4)+++
कथं तर्हीमे शब्दाः प्रतिपत्तव्याः? किंचित् सामान्य-विशेषवल्-लक्षणं प्रवर्त्यम् । येनाल्पेन यत्नेन महतो महतः शब्दौघान् प्रतिपद्येरन् ॥ किं पुनस् तत् ? उत्सर्गापवादौ । कश्चिदुत्सर्गः कर्तव्यः, कश्चिदपवादः ॥
For headwords which occur in both MW and PWG, it may be an easy programmatic solution, but not sure how to handle the cases where there is no matching headword in PWG.
I just recalled why I said I have no great expectations in programmatic approach to @funderburkjim in the other (parent) issue.
There are quite some entries that were to be corrected for accents in both PWG and pwk (& pwkvn).
So even if the metalines' k2 entries are compared between MW and these, those VN forms still remain uncaught, for those are lying in the body portion still, and not carried into the HW portion yet.
It was with my intervention that this correction has happened in just MW (last year), from its annexure data.
@drdhaval2785 and @funderburkjim may think of getting some means to cover this point in a programmatic way.
[I have some other points at the back of my mind, and would post subsequently at some time.]
A single sample
I see no use of giving any example, @gasyoun !!
I don't have to prove here that I have looked into the print (pdf) and text (cdsl file) data close enough.
Just opened the two log files by @drdhaval2785 , and noticed that neither of them contain the 'aMhu' entry.
MW has it as two parts
<L>126<pc>1,2<k1>aMhu<k2>aMhu<e>2
<L>127<pc>1,2<k1>aMhu<k2>aMhu<e>2B
Incidentally the 2nd part <L>127
is to be with the acute accent as per MW, not the 1st part <L>126
(which is with udAtta accent in PWG and pwk)
Also for future ref. (if it ever happens!), note the accent difference in the cross-referred word, <s>paro/-'Mhu</s>
in MW (<L>126
body portion) as against {#paroMhu#}
in PWG VN (<L>62430
body portion); and CDSL MW text has this word marked with acute mark (as compared to the print having the grave mark; PWG suggests no accent at all).
Though MW99 has picked up much of its data from the Boethlingk's dictionaries, undoubtedly it did take help from many other sources and also has some independent work (I would estimate the ratio as 75:25 roughly, for the above two portions); thus, we cannot always take Boethlingk as the ultimate authority!
PWG has-- <L>55<pc>1-0007<k1>aMhu<k2>aMhu/
pwk has-- <L>54<pc>1001-3<k1>aMhu<k2>aMhu/<e>100
Leaving the actual differences between MW and PWG/pwk accents (as above) aside, the main point is that, the 'logic' used in identifying the differences programmatically needs some 'refining'.
A single sample
I see no use of giving any example, @gasyoun !!
I don't have to prove here that I have looked into the print (pdf) and text (cdsl file) data close enough.
@gasyoun Just not to leave your 'wish' unfulfilled, here is a case showing a correct accent and a wrong accent (wrt the print) in the (suggested) portions in CDSL text data (however neither of these got applied to the resp. HW!!)-
-----------------
@drdhaval2785 these are the (suggested) accents in the first part of the compound words in the MW print.
I was referring to all such cases, in my post above-- https://github.com/sanskrit-lexicon/MWS/issues/141#issuecomment-1251973696
Another (minor) discrepancy of CDSL data wrt the print can be seen in the above snippet-- while the first word 116701 has the accent info preceding the lexical info, the second word 116702 has it following the lex. info!!
The print is consistently having the accent info before the lex. info all through its pages (there might be some cases in opposite, but they would surely be rare).
Slight correction. 499 is subset of 3169, and not in addition thereto. So total 3169 diffferences. Quite sizeable.
@drdhaval2785, @gasyoun
Do you recall that at one time (about a year back?) I was saying that my estimate was that ~2% of HWs would be needing correction, which you were thinking to be 'a bit too much' of estimation (with so much of work/cleaning done over so many years)?
Whether it is the error in spelling or in accent, after all it is an error; and keep in mind that MW has wrote a spl. note about the accents in his introduction, citing its importance specifically in Sanskrit.
[I do not like to say this-- but as I am looking deep into the text, I am finding that more errors got introduced into the text, as against the cleaning part, esp. while tagging various entities; in summary, my feeling is that it was more of tagging that took place than cleaning the MW text.]
[I am quite sure there still would be more entries in the text, that need to be identified and corrected.]
We do not care about the text yet, only headwords.
Even I was talking about HWs only, @gasyoun [in my words, text is the typed matter (whether it is HWs part or the rest); when I mean the meaning(s) part, I would be specifically saying body portion]!!
Do you recall that at one time (about a year back?) I was saying that my estimate was that ~2% of HWs would be needing correction, which you were thinking to be 'a bit too much' of estimation (with so much of work/cleaning done over so many years)?
Yes.
Whether it is the error in spelling or in accent, after all it is an error; and keep in mind that MW has wrote a spl. note about the accents in his introduction, citing its importance specifically in Sanskrit.
Fully agree.
[I do not like to say this-- but as I am looking deep into the text, I am finding that more errors got introduced into the text, as against the cleaning part, esp. while tagging various entities; in summary, my feeling is that it was more of tagging that took place than cleaning the MW text.]
A small portion of tagging is finalised. So I do believe most of the time was spent on headwords. It still remains the top priority for me personally. And I'm happy and indebted to @Andhrabharati undertaking this accent battle as well. All the tags come after. And it will be soon 10 years as we've started the headword cleaning battle.
So I do believe most of the time was spent on headwords.
And it will be soon 10 years as we've started the headword cleaning battle.
I don't think if 'cleaning' just the HWs portion would take more than 4-5 weeks (at the max.), and I am hearing here that the exercise is still incomplete (for 10 years now)!
[I would've had happily done this long back, had I got what I wanted few months back; now my mind is fully on some other task.]
HWs portion would take more than 4-5 weeks (at the max.)
5 weeks of everyday work and still too little amount of time.
my mind is fully on some other task
non-Sanskrit?
This work was done in issue141 directory.
As mentioned above, there are many more metalines in pwg with svarita accents than in mw:
114 matches for "<k2>.*\^" in buffer: mw.txt svarita accents
470 matches for "<k2>.*\^" in buffer: pwg.txt
The 114 in mw were compared to the printed mw, and a few corrections made. See change_mw_1.txt. Then, for each of the 470 pwg entries, the corresponding mw entries were compared to printed mw, and changes made. See change_mw_2.txt. There was also one typo corrected in pwg. The analysis uses variations of @drdhaval2785 's find_accent_diff.py . After these changes to mw,
849 matches for "<k2>.*\^" in buffer: temp_mw_2.txt
svarita_mw_2.txt lists these metalines.
There are 81 additional cases (see See ad2arev.txt ) where pwg shows a svarita accent, but either
There are numerous (about 125) cases where MW has, in addition to a svarita-accented form, also an unaccented form. For example namasya:
An iast version of the revised (temp_mw_2.txt) mw: mw_2_svarita_iast.zip
There are 81 additional case
Interesting indeed, so MW is not a pure copycat.
71873 matches for "<k2>.*[\/^].*[-—]" in buffer: temp_mw_2.txt
For example, aMSa has an accent, and this accent is, in the CDSL coding, 'inherited' by
compounds of aMSa.
<L>10<pc>1,1<k1>aMSa<k2>a/MSa<e>1
...
<L>20<pc>1,1<k1>aMSakaraRa<k2>a/MSa—karaRa<e>3
<L>21<pc>1,1<k1>aMSakalpanA<k2>a/MSa—kalpanA<e>3
<L>22<pc>1,1<k1>aMSaprakalpanA<k2>a/MSa—prakalpanA<e>3
etc.
I think this 'accent inheritance in compounds' principle of CDSL is likely wrong in general. For instance
<k2>a/MSa—karaRa
should be changed to <k2>aMSa—karaRa
(remove accent).
Should the principle be?
Always remove inherited accents in compounds unless MW specifically says to use them
.
For example, the inherited accent 'sva^r' should be removed in svargiri and svarjit,
but retained in svarcakzas and svarcanas:
<L>259095<pc>1281,1<k1>svar<k2>sva^r<h>4<e>2
...
<L>259109<pc>1281,2<k1>svargiri<k2>sva/r—giri<e>3 to change to svar—giri
<L>259110<pc>1281,2<k1>svarcakzas<k2>sva^r—cakzas<e>3 ok
<L>259111<pc>1281,2<k1>svarcanas<k2>sva^r—canas<e>3 ok
L>259112<pc>1281,2<k1>svarjit<k2>sva/r—ji/t<h>a<e>3 to change to svar—ji/t
71873 matches for "<k2>.*[\/^].*[-—]" in buffer: temp_mw_2.txt
For example, aMSa has an accent, and this accent is, in the CDSL coding, 'inherited' by compounds of aMSa.<L>10<pc>1,1<k1>aMSa<k2>a/MSa<e>1 ... <L>20<pc>1,1<k1>aMSakaraRa<k2>a/MSa—karaRa<e>3 <L>21<pc>1,1<k1>aMSakalpanA<k2>a/MSa—kalpanA<e>3 <L>22<pc>1,1<k1>aMSaprakalpanA<k2>a/MSa—prakalpanA<e>3 etc.
I think this 'accent inheritance in compounds' principle of CDSL is likely wrong in general. For instance
<k2>a/MSa—karaRa
should be changed to<k2>aMSa—karaRa
(remove accent).
In all these particular cases, accent actually would lie in the second part of the compound. For bahuvrIhi compounds (and a few other exceptions), the first constituent's accent would be retained. This is not possible to determine programmatically. So, indeed, it is a good idea to remove accent from both parts <k2>aMSa—karaRa
. However, for the convenience of those who care for accents, it would be a great idea to provide independent accents for both the constituents (by adding a string like (a/MSa + ka/raRa)
), so that they can work out the final accent of the compound in their heads.
Of course, the accent of the first constituent is easily available, and that of the second part may or may not be determined without ambiguity by a further lookup (eg. both करण॑ and क॑रण exist). So, the accent of the second part can be shown only in unambiguous cases.
Should the principle be?
Always remove inherited accents in compounds unless MW specifically says to use them
. For example, the inherited accent 'sva^r' should be removed in svargiri and svarjit, but retained in svarcakzas and svarcanas:
Sounds like a good idea!
There are 81 additional case
Interesting indeed, so MW is not a pure copycat.
See my post above at https://github.com/sanskrit-lexicon/MWS/issues/141#issuecomment-1254703229
Though MW99 has picked up much of its data from the Boethlingk's dictionaries, undoubtedly it did take help from many other sources and also has some independent work (I would estimate the ratio as 75:25 roughly, for the above two portions); thus, we cannot always take Boethlingk as the ultimate authority!
"and also has some independent work"
Should the principle be?
Always remove inherited accents in compounds unless MW specifically says to use them
. For example, the inherited accent 'sva^r' should be removed in svargiri and svarjit, but retained in svarcakzas and svarcanas:
I had posted several messages above, on the same point--
https://github.com/sanskrit-lexicon/MWS/issues/141#issuecomment-1251973696
https://github.com/sanskrit-lexicon/MWS/issues/141#issuecomment-1254753999
https://github.com/sanskrit-lexicon/MWS/issues/141#issuecomment-1254781142
There are numerous (about 125) cases where MW has, in addition to a svarita-accented form, also an unaccented form.
That is how it is! The accent would change at different contexts, and also at different 'lexical' forms. [Sometimes, even the same lexical form could be having different accents!]
it would be a great idea to provide independent accents for both the constituents (by adding a string like (a/MSa + ka/raRa)), so that they can work out the final accent of the compound in their heads.
@Andhrabharati that does not make much sense to me. To give something wrong, that one needs to recalculate in his head.
So, the accent of the second part can be shown only in unambiguous cases.
Programmatically?
75:25 roughly
Missed that one before.
it would be a great idea to provide independent accents for both the constituents (by adding a string like (a/MSa + ka/raRa)), so that they can work out the final accent of the compound in their heads.
@Andhrabharati that does not make much sense to me. To give something wrong, that one needs to recalculate in his head.
Saying Water (←H₂ + O₂)
instead of H₂O
water
became wrong since when?
So, the accent of the second part can be shown only in unambiguous cases.
Programmatically?
Yes
75:25 roughly
Missed that one before.
The focus here is on the MW headwords whose 'k2' differs from PWG, where PWG has an udAtta accent, and where MW has non-samAsa entries. (i.e., similar to prior phase, except here udAtta and prior phase was svarita).
The work is still in the issue141 directory. The mw change transactions are in change_mw_3.txt (about 600 lines changed) . Details can be seen in the commit above.
There are headwords where two accented variants are presented.
<L>6230<pc>32,1<k1>anugra<k2>a/n-ugra,an-ugra/<e>1 <<< NOTE THE COMMA
<s>a/n-ugra</s> or <s>an-ugra/</s> ¦ <lex>mf(<s>A</s>)n.</lex> not harsh or violent, mild, gentle, <ls>RV.</ls> &c.<info lex="m:f#A:n"/><info or="6230,anugra"/>
It was convenient to extend the metaline convention to allow a comma-delimited list for k2. See the sections singleton_or_and changes and temp_singleton_k2changes of change_mw_3. This resolved several of the udAtta accent differences with PWG. At this point, there were 350+ mw entries to compare with pwg (see ad3_rev.txt). The mw print was examined by hand, and the CDSL k2 markup classified as '+' (200+ CDSL agrees with print) or 'x' CDSL k2 markup may disagree with print (160+ cases). Then changes were made for the 'x' cases, see temp_change_mw_3b.txt section of change_mw_3.txt. After all the changes, there remain about 275 cases with udAtta accents classified as differing from PWG (out of about 5000 cases). These are shown in file ad3b_rev.txt.
As the task progressed, I tried to develop rules to handle cases where the accent(s) in mw is not obvious, but requires some sort of inference. Sometimes, these rules are referenced in change_mw_3 (e.g. 50+ instances of Rule 1). The rules are:
<s>am</s>
) Xa/yam (i.e., inherited). Example uttaramInterested parties may wish to examine (in change_mw_3) instances of these rules.
Thus far, I have found in mw print only one exception to the only one accent per headword
rule. tAjadBaNga, and I changed that to agree with pwg and noted as
an 'mw print change' .
samAsa correction in mw.
The 'programmatic' mw accent corrections appear to me to be at an end. Further corrections require
manual review of mw.txt with the scans for all pages.
I've started this with pages 1-59.
Changes are in change_mw_6.txt.
The time required for these 59 pages was about 3 days, or 20 pages per day.
At this rate, the total cleanup remaining will require 2-3 months.
I had 'sensed' this, much before starting the programmatic approach!!
If the latest iast file is made, I might be able to help in the next portion of the corrections. (after a few days probably)
I had also noticed some pc errors in the metalines, that could also be covered in the manual checking of HWs.
@Andhrabharati Request you to do some random checking of the first batch of changes above, in case I need to make any mid-course corrections in method.
The main non-accent change in metalines that I've noticed is with the 'pc' value for the last item in a column. Quite often, the pc for this item incorrectly refers to the next column, and thus requires correction.
I'm also not examining the VN entries, since I believe you have previously corrected these, and I found no required corrections in the first few VN.
275 cases with udAtta accents classified as differing from PWG (out of about 5000 cases)
It will close the day when the Reverse Dictionary might get published thanks to such cleanup rounds.
pc for this item incorrectly refers to the next column, and thus requires correction.
Interesting to note
total cleanup remaining will require 2-3 months.
Major Tom calling for @Andhrabharati ))
@funderburkjim appears to have decided to work it out himself!!
[I had asked him to make the IAST file to do it; but he instead chose to continue the process with slp1, and has opened a new (continuation) issue]
And interestingly, seen that he is also filling up (some, if not all, of) the nom. case endings that I was talking about all these days for the past two years, that are missed/truncated in the current CDSL MW data!!
Probably, I might be able to do a full checkup once he finishes the process; though it takes his time, it definitely is a worthy spending at his end.
Probably @funderburkjim might close this issue, as another "continuation" issue is taken up now.
nom. case endings
I am trying to do that mostly when it seems to give additional information for entries whose base form has an accent. One example is under uzRa/.
The (<s>as</s>)
at the masculine form seems to give additional information (e.g., the masculine nom. singular is <s>uzRas</s>
(we would write with visarga <s>uzRaH</s>
but that's beside the point).)
This is instead of the possibly expected <s>uzRa/s</s>
.
So, most of the nominative case endings added by me are like this.
Here is an example where I didn't add back the nom. singular form.
'as' here is the normal nominative singular ending for a masculine noun whose citation endings in 'a'. And MW seems to me to be inconsistent in inserting the 'as'. For instance, there is no 'as' in uzmaka.
There would be no objection from me if, in his later review of mw, @Andhrabharati, he decides to be more thorough in adding to mw.txt the nominative endings which remain missing in the digitization.
In #140, it was mentioned that there are many errors in the coding of accents in the CDSl version of MW. This issue devoted to correcting these errors.
It is reasonable to restrict to headwords. The 'k2' (key2) field in the metaline shows accents.
We can assume there should be consistency in accent between MW and the Boehtlingk dictionaries (PW, PWG).
A reasonable first step might be to look at the svarita accents. For instance:
We could do such a comparison by program and print out the exceptions for hand examination.