MW72 corrections to Sanskrit italics, 3-gram

funderburkjim commented 7 years ago

This issue is devoted to corrections of MW72 text, as begun in #320.

The text in mw72 identified as (a) italic and (b) Sanskrit have been examined for possible spelling errors. In this study, possible errors are chosen on the basis of the word having an SLP1 spelling with an unusual 3-gram (when compared to 3-grams of all headwords in Sanskrit dictionaries).

600 cases have been so identified.

These cases have been broken into smaller batches, identifed by a batch number of 301 to 321. Each batch has 30 cases, except for the first two which have 15 each.

There is a User-Interface (UI) for marking corrections. The url for the UI depends on the batch number. Here is the url for batch 302:

http://www.sanskrit-lexicon.uni-koeln.de/scans/MW72Scan/2014/pywork/correctionwork/issue-320/302/update.php

funderburkjim commented 7 years ago

Here is a suggested work flow for working on a batch.

Batch check-out
- edit the progress table appearing in the comment below.
- enter your GitHub user name, and the date begun fields for the batch
- Now other users know you are working on this batch.
Make corrections using Batch UI
- In the browser, copy paste the above url for the UI, replacing '302' by your batch number
- In the UI for the batch, work on the cases until all are done. This can be done in multiple sessions.
Batch check-in
- edit the progress table appearing in the comment below.
- enter the Date End field.
Notify me that the batch is ready to install
- enter a comment in this issue such as 'Batch 302 ready for installation'
I'll work through the corrections for the batch, including examination of comments.
- I'll install the corrections at Cologne,
- I will edit the Progress Table below, inserting the installation date into the Installed field.
This will complete the work for the given batch.

funderburkjim commented 7 years ago

Progress Table

Batch	case1-case2	User	Date Begin	Date End	Installed
301	1-15	@funderburkjim	11/25/2016	11/27/2016	11/28/2016
302	16-30	@SergeA	11/28/2016	11/28/2016	11/28/2016
303	31-60	@SergeA	11/29/2016	11/29/2016	11/30/2016
304	61-90	@SergeA	11/29/2016	11/30/2016	11/30/2016
305	91-120	@SergeA	11/30/2016	11/30/2016	12/01/2016
306	121-150	@SergeA	11/30/2016	11/30/2016	12/01/2016
307	151-180	@SergeA	12/01/2016	12/01/2016	12/01/2016
308	181-210	@SergeA	12/01/2016	12/01/2016	12/02/2016
309	211-240	@SergeA	12/01/2016	12/02/2016	12/02/2016
310	241-270	@SergeA	12/02/2016	12/02/2016	12/02/2016
311	271-300	@SergeA	12/07/2016	12/07/2016	12/10/2016
312	301-330	@SergeA	12/07/2016	12/07/2016	12/10/2016
313	331-360	@SergeA	12/09/2016	12/09/2016	12/10/2016
314	361-390	@SergeA	12/10/2016	12/10/2016	12/10/2016
315	391-420	@SergeA	12/10/2016	12/10/2016	12/10/2016
316	421-450	@SergeA	12/13/2016	12/13/2016	12/13/2016
317	451-480	@SergeA	12/13/2016	12/13/2016	12/13/2016
318	481-496	@SergeA	12/13/2016	12/13/2016	12/13/2016

funderburkjim commented 7 years ago

@SergeA

Does this approach sound ok?

I'll assume that you will be the primary worker on these cases. OK?

Note: As of now, batch 301 is complete. I've prepared only batch 302. When batch 302 is completed, I'll go ahead and prepare the rest of the batches.

SergeA commented 7 years ago

Ok. I'll start right now. But there is a little problem. I cann't edit this table.

SergeA commented 7 years ago

Finished. Batch 302 ready for installation :) Also the case 13. murcC (amūrćhīt) has typo in the headword - मुर्च्छ् instead of मुर्छ्.

funderburkjim commented 7 years ago

I can't edit this table.

To edit the table, you click on the little pencil:

Do the editing, and click the 'Update Comment' button.

If you don't have permissions, let Marcis solve that.

For now, I'll edit table.

funderburkjim commented 7 years ago

Batch 302 now installed. Everything looked perfect!

मुर्छ्.

Good to mention things like this. An alternate way to communicate such 'extra changes' is in the 'comment' section for the case in the UI, as I pay attention to these comments during installation.

Notice that the PROGRESS TABLE above is now updated to show the installation has been done. Also, anyone revisiting the UI for batch 302 will see a message THESE CORRECTIONS ARE INSTALLED.

funderburkjim commented 7 years ago

All the batches have now been generated.

You can use these links to get to the batches, if you find the links convenient:

batch 301 batch 302 batch 303 batch 304 batch 305
batch 306 batch 307 batch 308 batch 309 batch 310
batch 311 batch 312 batch 313 batch 314 batch 315
batch 316 batch 317 batch 318

One unexpected thing occurs in the batches: Although there are supposed to be 30 cases, sometimes there are more than 30. This is especially noticeable in:

See comment below. This comment now obsolete

batch 316 (76 actual cases),
batch 317 (205 actual cases)
batch 319 (65 actual cases)

If this proves to be a problem (e.g., updates are too slow), let me know and I can break these up into smaller sets of actual cases.

NOTE: I think the reason for this difference between nominal and actual batch size is:

a case is a particular Sanskrit word in a particular MW72 headword.
Sometimes, a MW72 headword can be long, and can contain multiple copies of the particular Sanskrit word in question.

funderburkjim commented 7 years ago

Tomorrow, I'll set up a similar system for dealing with the Wilson 3-gram cases, and will begin work on them.

@SergeA Happy correcting! Thanks for the help 👍

gasyoun commented 7 years ago

I cann't edit this table.

Done.

http://www.sanskrit-lexicon.uni-koeln.de/scans/MW72Scan/2014/pywork/correctionwork/issue-320/321/update.php

Why twice prayuj?

You can use these links to get to the batches, if you find the links convenient

Thanks.

Sometimes, a MW72 headword can be long, and can contain multiple copies of the particular Sanskrit word in question.

Yeah, that seems to be the case in prayuj above as well.

SergeA commented 7 years ago

303 + 304 ready

305 + 306

funderburkjim commented 7 years ago

@gasyoun Glad you noticed the 'prayuj' example, as it indicated several things of interest.

Under headword prayuj, there are two 3-gram cases:
```
yug-GavIMzi:1:prayuj,32610,146337,yug-ghavīṉshi##unknowns=IMz
havIMzi:1:prayuj,32610,146336,havīṉshi##unknowns=IMz
```
- In the IAST spelling we see yug-ghavīṉshi and havīṉshi. The second one is a substring of the first. For the yug-ghavīṉshi case, the program generated a case for 146337 - this is good. But, the program was also generating, for the havīṉshi case, a potential correction involving also involving line 146337 -- this is bad.
- The program has now been adjusted to avoid this substring matching problem.
When regenerating the ending batches (316 -321) using the improved matching, I noticed that the number of actual cases decreased some (e.g. for batch 317, the number of cases decreased from 205 to 180.
A closer examination of batch 317 cases showed that nearly all the 'extra' cases involve a particular error under headword 'zaz' (hk = SaS).
- The error is that the digitization uses Śh (with an accent over 'S') instead of Sh (no accent) for the retroflex sibilant.
- This error accounts for 96 of the nominal 600 cases ! and for many more of the actual cases based on the nominal cases.
- The first occurrence would be in batch 316.
This Śh is ALWAYS an error since Ś (capital S with accent) represents the palatal sibilant, and is not followed by 'h' in any Sanskrit word.
- Further, although the correction to some hypothetical word beginning Śh could be to remove the 'h', since all those 96 cases occur under zaz, it is almost certain that the actual correction required is to remove the accent over the 'S', i.e., to change Śh to `Sh'.
- As further confirmation of the nature of the change to make, here is the list of words - focus on the IAST spelling before the '##'. You'll see that all of them are some sandhi-altered form of 'zaz'. zaz-examples.txt
- Since there are so many such cases, and since the solution is always the same, it makes sense to write a little program to generate all the corrections.
- This has been done. There were changes on 97 lines of mw72.txt. 89 of these lines were under headword zaz; the other 8 were under other headwords whose slp spelling begins with 'za'
  - za (2), zaRQa (1), zazwi (3), zazWa (1), zANguRya (1)

funderburkjim commented 7 years ago

Next, the ngram3cases.txt file was modified to remove the Śh cases - These occurred starting with former case 437.

After this removal, there are 496 total cases.

The removed case was in batch 316. So batches 316-318 were regenerated.

Since there are now only 496 nominal cases, batches 319, 320 and 321 are not required.

For completeness, batches 306-315 have also been regenerated; @SergeA has already worked on batches 303, 304, and 305, so these have not been regenerated.

Although the nominal-actual situation is still theoretically possible, with the removal of the Sh cases and the removal of the substring problem, extra actual cases are now rare (biggest instance, 36 actual cases v. 30 nominal cases in batch 313).

SergeA commented 7 years ago

Marked two similar print errors in MW72. chṛd/chṛntte but ācchṛd/ācchṛnte instead of ācchṛntte chid/chintte but avacchid/ avacchinte instead of avacchintte Perhaps there are more similar cases.

drdhaval2785 commented 7 years ago

Marked two similar print errors in MW72. chṛd/chṛntte but ācchṛd/ācchṛnte instead of ācchṛntte chid/chintte but avacchid/ avacchinte instead of avacchintte Perhaps there are more similar cases.

झरो झरि सवर्णे 8.4.65 rule allows both 'ntt' and 'nt' as valid form.

drdhaval2785 commented 7 years ago

Welcome aboard @SergeA.

gasyoun commented 7 years ago

The program has now been adjusted to avoid this substring matching problem.

Great.

When regenerating the ending batches (316 -321) using the improved matching, I noticed that the number of actual cases decreased some (e.g. for batch 317, the number of cases decreased from 205 to 180.

The better.

This Śh is ALWAYS an error since Ś (capital S with accent) represents the palatal sibilant, and is not followed by 'h' in any Sanskrit word.

Indeed. MW72 transliteration is miserable in all possible ways.

This has been done. There were changes on 97 lines of mw72.txt. 89 of these lines were under headword zaz; the other 8 were under other headwords whose slp spelling begins with 'za'

Good job, Jim. Methodological as usual.

Although the nominal-actual situation is still theoretically possible

Ignore such cases, please.

झरो झरि सवर्णे 8.4.65 rule allows both 'ntt' and 'nt' as valid form.

I wonder Dhaval if other dictionaries us the ntt variant as well.

SergeA commented 7 years ago

Marked two similar print errors in MW72. chṛd/chṛntte but ācchṛd/ācchṛnte instead of ācchṛntte chid/chintte but avacchid/ avacchinte instead of avacchintte Perhaps there are more similar cases.

झरो झरि सवर्णे 8.4.65 rule allows both 'ntt' and 'nt' as valid form.

Perhaps, by Panini it is valid. Panini has many complicated rules of optional reduplication and elision of letters. But from the european point of view we have here "d" from the root (chid/chind or chṛd/chṛnd) and "t" from the termination (-te) resulting "tt" without any additional constrictions. Besides, it is inconsistent to write sometimes chintte and sometimes chinte. I think it was a mere error here due to confusion of त्ते and ते.

Welcome aboard @SergeA.

Thanx. :)

funderburkjim commented 7 years ago

Re ntt v. nt .

It is tough to decide how to handle (a) grammatical options and (b) author inconsistency.

My view at the moment is that it is premature to impose consistency on MW72 at this time -- we are dealing with more mundane issues now. Since optional consonant doubling in these cases is grammatically justifiable, I think we should leave the spellings of MW72 as they appear in the text.

@SergeA I suggest you open a new issue, perhaps with the label 'Research' and repeat the arguments that you and Dhaval have raised. This new issue can remain open and therefore visible. When time permits, we could do complete text-wide examination of 'similar' cases. This data would provide a solid basis for saying that certain variants should be considered print errors. After such an investigation, we might end up with the changes you suggested.

Incidentally, there are many annoying inconsistencies in the 1899 edition of MW as well; especially in use of anusvara or homorganic nasal.

I can dimly imagine a time when we, or someone, make a new-improved MW dictionary, without inconsistencies, and with many other improvements.

gasyoun commented 7 years ago

Since optional consonant doubling in these cases is grammatically justifiable, I think we should leave the spellings of MW72 as they appear in the text.

And maybe add some tag, the not OK are equal OK?

Incidentally, there are many annoying inconsistencies in the 1899 edition of MW as well; especially in use of anusvara or homorganic nasal.

Yeah, that's a swamp.

funderburkjim commented 7 years ago

in batch 303, case 19. idAnIms

The 's' is not part of the Sanskrit word; it is an English 's' for plural . I'm also not sure about that 'm', as idAni seems likely, as your comment said. For now, just moving that 's' out of the scope of Sanskrit (i.e. as a separately marked italic non-sanskrit.)

{%idānīm%}<nsi>s</nsi>
[That 'nsi' tag is used in mw72 to indicate 'non-sanskrit-italic]

funderburkjim commented 7 years ago

in batch 303, case 22.
I think the dental 'd' should be retroflex: īdāṅ-ćakre -> īḍāṅ-ćakre, as this seems to periphrastic perfect of īḍ.

32809 old <>{%īḍishe,%} Ved. {%īḷishe%}), {%īdāṅ-ćakre, īḍishyate,%}
32809 new <>{%īḍishe,%} Ved. {%īḷishe%}), {%īḍāṅ-ćakre, īḍishyate,%}

Found confirmation in vcp.

ईड [p= 1008] : ईड¦ स्तुतौ अदा० आत्म० सक० सेट् । ईड्वे ईडिषे ईडिध्वं
ऐडिष्ट ईडाम्--बभूव आस चक्रे । ईडिता ईडिष्यते ऐडि-

Classify as print error. Put comment in UI.

funderburkjim commented 7 years ago

in batch 304:

; Case 12.  L=13389, key1=kawu, dict=mw72, type=p,status=DONE
; kaṭutarāi -> kaṭutarāï
44576 old <>of a plant, {%= tikta-tuṇḍī,%} commonly {%kaṭutarāi.%}
44576 new <>of a plant, {%= tikta-tuṇḍī,%} commonly <nsi>kaṭutarāï.</nsi>

I think that kaṭutarāï is a Tamil word, not Sanskrit, so have marked as <nsi> --- Can anyone confirm that this word is a plant name in Tamil --- or, if I'm wrong, that it is in fact a Sanskrit word?

SergeA commented 7 years ago

I think we should leave the spellings of MW72 as they appear in the text.

Ok.

@SergeA I suggest you open a new issue, perhaps with the label 'Research' and repeat the arguments that you and Dhaval have raised.

Could you please do it youself. You know the right way. And I'm a bit unfamiliar with this GitHub system.

I think that kaṭutarāï is a Tamil word, not Sanskrit

mentioned in SKD: कटुतुण्डी, स्त्री, (कटु तीव्रं तुण्डमस्याः ।) लताप्रभेदः । कटुतराइ इतिख्याता ।

funderburkjim commented 7 years ago

batch 304, case 24: @SergeA

I marked all three words as non-Sanskrit. Agree?


; kāḷī -> kāḷī, non-Sanskrit.
196641 old <>dark Śālmali ({%= Marāṭhī kāḷī sāmvarī%}) {%= vaṉśa-%}
196641 new <>dark Śālmali (= <nsi>Marāṭhī kāḷī sāmvarī</nsi>) {%= vaṉśa-%}
``

funderburkjim commented 7 years ago

batch 304, case 29. Changed to print error:

; Kṛitoććhais -> Kṛitoććhais CHANGE NOT MADE
56282 old <>jealous. {%--Kṛitoććhais (ºta-ućº),%} ind. raised on high.
56282 new <>jealous. {%--Kṛitoććais (ºta-ućº),%} ind. raised on high.

Reason: uććais is the word (indeclineable) for 'high'; I don't think there is a word uććhais.

funderburkjim commented 7 years ago

kaṭutarāï

Since found in SKD, will change back to calling it Sanskrit.

SergeA commented 7 years ago

uććais is the word (indeclineable) for 'high'

Sure. Mea culpa.

funderburkjim commented 7 years ago

jahāngīrī (Batch 307, case 8). The 'non-Sanskrit?' speculation seems highly likely (Persian) and I've so marked.

See interesting comment

SergeA commented 7 years ago

As I can see, MW72 provides many suspicious words in definitions of names of plants, names of places etc. Often they follow after "=" or "commonly". Perhaps those are denominations from some local dialects. Sometimes he marks it as Hindi, Marathi etc, and sometimes no. It is very difficult to say if this is a Sanskrit or not. In theory every such name or term can be borrowed from any language to Sanskrit and can be used in Sanskrit texts. The only limitation is the alphabet.

funderburkjim commented 7 years ago

Regarding case 8 of batch 314 laghu-kāvaḷī - @SergeA mentions that it might be non-Sanskrit. I'm leaving it marked as Sanskrit, since it seems the 3rd element in a list whose first two elements are Sanskrit words (per MW99).

This word also occurs in cases 8,9,10 under hw= DvANkza.

Found confirmation under DvANkza in PW:

gasyoun commented 7 years ago

first two elements are Sanskrit words

Being headwords equals Sanskrit?

funderburkjim commented 7 years ago

Since kAkolI and kakkolikA are headwords in MW99 they are Sanskrit words; since laGu-kAvaLI is the third in this MW72 list of words, the inference is that it too is Sanskrit word. As shown, this inference is confirmed since the word appears in Devanagari in PWG from which we also infer that PWG considers laGu-kAvaLI to be a Sanskrit word.

gasyoun commented 7 years ago

PWG considers laGu-kAvaLI to be a Sanskrit word

What I meant is that there are few rare non-Sanskrit words in Sanskrit dictionaries as headwords as we've seen in MW, but hope it's not such case. The logic makes sense.

SergeA commented 7 years ago

Batch 318 finished. :)

funderburkjim commented 7 years ago

Everything installed.

All done.

Thanks, @SergeA !

sanskrit-lexicon / CORRECTIONS

MW72 corrections to Sanskrit italics, 3-gram #322

Progress Table

See comment below. This comment now obsolete