sanskrit-lexicon / CORRECTIONS

Correction history for Cologne Sanskrit Lexicon
8 stars 5 forks source link

MW72 corrections to Sanskrit italics, 3-gram #322

Closed funderburkjim closed 7 years ago

funderburkjim commented 7 years ago

This issue is devoted to corrections of MW72 text, as begun in #320.

The text in mw72 identified as (a) italic and (b) Sanskrit have been examined for possible spelling errors. In this study, possible errors are chosen on the basis of the word having an SLP1 spelling with an unusual 3-gram (when compared to 3-grams of all headwords in Sanskrit dictionaries).

600 cases have been so identified.

These cases have been broken into smaller batches, identifed by a batch number of 301 to 321. Each batch has 30 cases, except for the first two which have 15 each.

There is a User-Interface (UI) for marking corrections. The url for the UI depends on the batch number. Here is the url for batch 302:

http://www.sanskrit-lexicon.uni-koeln.de/scans/MW72Scan/2014/pywork/correctionwork/issue-320/302/update.php
funderburkjim commented 7 years ago

Here is a suggested work flow for working on a batch.

funderburkjim commented 7 years ago

Progress Table

Batch case1-case2 User Date Begin Date End Installed
301 1-15 @funderburkjim 11/25/2016 11/27/2016 11/28/2016
302 16-30 @SergeA 11/28/2016 11/28/2016 11/28/2016
303 31-60 @SergeA 11/29/2016 11/29/2016 11/30/2016
304 61-90 @SergeA 11/29/2016 11/30/2016 11/30/2016
305 91-120 @SergeA 11/30/2016 11/30/2016 12/01/2016
306 121-150 @SergeA 11/30/2016 11/30/2016 12/01/2016
307 151-180 @SergeA 12/01/2016 12/01/2016 12/01/2016
308 181-210 @SergeA 12/01/2016 12/01/2016 12/02/2016
309 211-240 @SergeA 12/01/2016 12/02/2016 12/02/2016
310 241-270 @SergeA 12/02/2016 12/02/2016 12/02/2016
311 271-300 @SergeA 12/07/2016 12/07/2016 12/10/2016
312 301-330 @SergeA 12/07/2016 12/07/2016 12/10/2016
313 331-360 @SergeA 12/09/2016 12/09/2016 12/10/2016
314 361-390 @SergeA 12/10/2016 12/10/2016 12/10/2016
315 391-420 @SergeA 12/10/2016 12/10/2016 12/10/2016
316 421-450 @SergeA 12/13/2016 12/13/2016 12/13/2016
317 451-480 @SergeA 12/13/2016 12/13/2016 12/13/2016
318 481-496 @SergeA 12/13/2016 12/13/2016 12/13/2016
funderburkjim commented 7 years ago

@SergeA

Does this approach sound ok?

I'll assume that you will be the primary worker on these cases. OK?

Note: As of now, batch 301 is complete. I've prepared only batch 302. When batch 302 is completed, I'll go ahead and prepare the rest of the batches.

SergeA commented 7 years ago

Ok. I'll start right now. But there is a little problem. I cann't edit this table.

SergeA commented 7 years ago

Finished. Batch 302 ready for installation :) Also the case 13. murcC (amūrćhīt) has typo in the headword - मुर्च्छ् instead of मुर्छ्.

funderburkjim commented 7 years ago

I can't edit this table.

To edit the table, you click on the little pencil:

image

Do the editing, and click the 'Update Comment' button.

If you don't have permissions, let Marcis solve that.

For now, I'll edit table.

funderburkjim commented 7 years ago

Batch 302 now installed. Everything looked perfect!

मुर्छ्.

Good to mention things like this. An alternate way to communicate such 'extra changes' is in the 'comment' section for the case in the UI, as I pay attention to these comments during installation.

Notice that the PROGRESS TABLE above is now updated to show the installation has been done. Also, anyone revisiting the UI for batch 302 will see a message THESE CORRECTIONS ARE INSTALLED.

funderburkjim commented 7 years ago

All the batches have now been generated.

You can use these links to get to the batches, if you find the links convenient:

batch 301 batch 302 batch 303 batch 304 batch 305
batch 306 batch 307 batch 308 batch 309 batch 310
batch 311 batch 312 batch 313 batch 314 batch 315
batch 316 batch 317 batch 318

One unexpected thing occurs in the batches: Although there are supposed to be 30 cases, sometimes there are more than 30. This is especially noticeable in:

See comment below. This comment now obsolete

If this proves to be a problem (e.g., updates are too slow), let me know and I can break these up into smaller sets of actual cases.

NOTE: I think the reason for this difference between nominal and actual batch size is:

funderburkjim commented 7 years ago

Tomorrow, I'll set up a similar system for dealing with the Wilson 3-gram cases, and will begin work on them.

@SergeA Happy correcting! Thanks for the help 👍

gasyoun commented 7 years ago

I cann't edit this table.

Done.

http://www.sanskrit-lexicon.uni-koeln.de/scans/MW72Scan/2014/pywork/correctionwork/issue-320/321/update.php

Why twice prayuj?

You can use these links to get to the batches, if you find the links convenient

Thanks.

Sometimes, a MW72 headword can be long, and can contain multiple copies of the particular Sanskrit word in question.

Yeah, that seems to be the case in prayuj above as well.

SergeA commented 7 years ago

303 + 304 ready

funderburkjim commented 7 years ago

@gasyoun Glad you noticed the 'prayuj' example, as it indicated several things of interest.

funderburkjim commented 7 years ago

Next, the ngram3cases.txt file was modified to remove the Śh cases - These occurred starting with former case 437.

After this removal, there are 496 total cases.

The removed case was in batch 316. So batches 316-318 were regenerated.

Since there are now only 496 nominal cases, batches 319, 320 and 321 are not required.

For completeness, batches 306-315 have also been regenerated; @SergeA has already worked on batches 303, 304, and 305, so these have not been regenerated.

Although the nominal-actual situation is still theoretically possible, with the removal of the Sh cases and the removal of the substring problem, extra actual cases are now rare (biggest instance, 36 actual cases v. 30 nominal cases in batch 313).

SergeA commented 7 years ago

Marked two similar print errors in MW72. chṛd/chṛntte but ācchṛd/ācchṛnte instead of ācchṛntte chid/chintte but avacchid/ avacchinte instead of avacchintte Perhaps there are more similar cases.

drdhaval2785 commented 7 years ago

Marked two similar print errors in MW72. chṛd/chṛntte but ācchṛd/ācchṛnte instead of ācchṛntte chid/chintte but avacchid/ avacchinte instead of avacchintte Perhaps there are more similar cases.

झरो झरि सवर्णे 8.4.65 rule allows both 'ntt' and 'nt' as valid form.

drdhaval2785 commented 7 years ago

Welcome aboard @SergeA.

gasyoun commented 7 years ago

The program has now been adjusted to avoid this substring matching problem.

Great.

When regenerating the ending batches (316 -321) using the improved matching, I noticed that the number of actual cases decreased some (e.g. for batch 317, the number of cases decreased from 205 to 180.

The better.

This Śh is ALWAYS an error since Ś (capital S with accent) represents the palatal sibilant, and is not followed by 'h' in any Sanskrit word.

Indeed. MW72 transliteration is miserable in all possible ways.

This has been done. There were changes on 97 lines of mw72.txt. 89 of these lines were under headword zaz; the other 8 were under other headwords whose slp spelling begins with 'za'

Good job, Jim. Methodological as usual.

Although the nominal-actual situation is still theoretically possible

Ignore such cases, please.

झरो झरि सवर्णे 8.4.65 rule allows both 'ntt' and 'nt' as valid form.

I wonder Dhaval if other dictionaries us the ntt variant as well.

SergeA commented 7 years ago

Marked two similar print errors in MW72. chṛd/chṛntte but ācchṛd/ācchṛnte instead of ācchṛntte chid/chintte but avacchid/ avacchinte instead of avacchintte Perhaps there are more similar cases.

झरो झरि सवर्णे 8.4.65 rule allows both 'ntt' and 'nt' as valid form.

Perhaps, by Panini it is valid. Panini has many complicated rules of optional reduplication and elision of letters. But from the european point of view we have here "d" from the root (chid/chind or chṛd/chṛnd) and "t" from the termination (-te) resulting "tt" without any additional constrictions. Besides, it is inconsistent to write sometimes chintte and sometimes chinte. I think it was a mere error here due to confusion of त्ते and ते.

Welcome aboard @SergeA.

Thanx. :)

funderburkjim commented 7 years ago

Re ntt v. nt .

It is tough to decide how to handle (a) grammatical options and (b) author inconsistency.

My view at the moment is that it is premature to impose consistency on MW72 at this time -- we are dealing with more mundane issues now. Since optional consonant doubling in these cases is grammatically justifiable, I think we should leave the spellings of MW72 as they appear in the text.

@SergeA I suggest you open a new issue, perhaps with the label 'Research' and repeat the arguments that you and Dhaval have raised. This new issue can remain open and therefore visible. When time permits, we could do complete text-wide examination of 'similar' cases. This data would provide a solid basis for saying that certain variants should be considered print errors. After such an investigation, we might end up with the changes you suggested.

Incidentally, there are many annoying inconsistencies in the 1899 edition of MW as well; especially in use of anusvara or homorganic nasal.

I can dimly imagine a time when we, or someone, make a new-improved MW dictionary, without inconsistencies, and with many other improvements.

gasyoun commented 7 years ago

Since optional consonant doubling in these cases is grammatically justifiable, I think we should leave the spellings of MW72 as they appear in the text.

And maybe add some tag, the not OK are equal OK?

Incidentally, there are many annoying inconsistencies in the 1899 edition of MW as well; especially in use of anusvara or homorganic nasal.

Yeah, that's a swamp.

funderburkjim commented 7 years ago

in batch 303, case 19. idAnIms

The 's' is not part of the Sanskrit word; it is an English 's' for plural . I'm also not sure about that 'm', as idAni seems likely, as your comment said. For now, just moving that 's' out of the scope of Sanskrit (i.e. as a separately marked italic non-sanskrit.)

{%idānīm%}<nsi>s</nsi>
[That 'nsi' tag is used in mw72 to indicate 'non-sanskrit-italic]
funderburkjim commented 7 years ago

in batch 303, case 22.
I think the dental 'd' should be retroflex: īdāṅ-ćakre -> īḍāṅ-ćakre, as this seems to periphrastic perfect of īḍ.

32809 old <>{%īḍishe,%} Ved. {%īḷishe%}), {%īdāṅ-ćakre, īḍishyate,%}
32809 new <>{%īḍishe,%} Ved. {%īḷishe%}), {%īḍāṅ-ćakre, īḍishyate,%}

Found confirmation in vcp.

ईड [p= 1008] : ईड¦ स्तुतौ अदा० आत्म० सक० सेट् । ईड्वे ईडिषे ईडिध्वं
ऐडिष्ट ईडाम्--बभूव आस चक्रे । ईडिता ईडिष्यते ऐडि-

Classify as print error. Put comment in UI.

funderburkjim commented 7 years ago

in batch 304:

; Case 12.  L=13389, key1=kawu, dict=mw72, type=p,status=DONE
; kaṭutarāi -> kaṭutarāï
44576 old <>of a plant, {%= tikta-tuṇḍī,%} commonly {%kaṭutarāi.%}
44576 new <>of a plant, {%= tikta-tuṇḍī,%} commonly <nsi>kaṭutarāï.</nsi>

I think that kaṭutarāï is a Tamil word, not Sanskrit, so have marked as <nsi> --- Can anyone confirm that this word is a plant name in Tamil --- or, if I'm wrong, that it is in fact a Sanskrit word?

SergeA commented 7 years ago

I think we should leave the spellings of MW72 as they appear in the text.

Ok.

@SergeA I suggest you open a new issue, perhaps with the label 'Research' and repeat the arguments that you and Dhaval have raised.

Could you please do it youself. You know the right way. And I'm a bit unfamiliar with this GitHub system.

I think that kaṭutarāï is a Tamil word, not Sanskrit

mentioned in SKD: कटुतुण्डी, स्त्री, (कटु तीव्रं तुण्डमस्याः ।) लताप्रभेदः । कटुतराइ इतिख्याता ।

funderburkjim commented 7 years ago

batch 304, case 24: @SergeA

I marked all three words as non-Sanskrit. Agree?


; kāḷī -> kāḷī, non-Sanskrit.
196641 old <>dark Śālmali ({%= Marāṭhī kāḷī sāmvarī%}) {%= vaṉśa-%}
196641 new <>dark Śālmali (= <nsi>Marāṭhī kāḷī sāmvarī</nsi>) {%= vaṉśa-%}
``
funderburkjim commented 7 years ago

batch 304, case 29. Changed to print error:

; Kṛitoććhais -> Kṛitoććhais CHANGE NOT MADE
56282 old <>jealous. {%--Kṛitoććhais (ºta-ućº),%} ind. raised on high.
56282 new <>jealous. {%--Kṛitoććais (ºta-ućº),%} ind. raised on high.

Reason: uććais is the word (indeclineable) for 'high'; I don't think there is a word uććhais.

funderburkjim commented 7 years ago

kaṭutarāï

Since found in SKD, will change back to calling it Sanskrit.

SergeA commented 7 years ago

uććais is the word (indeclineable) for 'high'

Sure. Mea culpa.

funderburkjim commented 7 years ago

jahāngīrī (Batch 307, case 8). The 'non-Sanskrit?' speculation seems highly likely (Persian) and I've so marked.

See interesting comment

SergeA commented 7 years ago

As I can see, MW72 provides many suspicious words in definitions of names of plants, names of places etc. Often they follow after "=" or "commonly". Perhaps those are denominations from some local dialects. Sometimes he marks it as Hindi, Marathi etc, and sometimes no. It is very difficult to say if this is a Sanskrit or not. In theory every such name or term can be borrowed from any language to Sanskrit and can be used in Sanskrit texts. The only limitation is the alphabet.

funderburkjim commented 7 years ago

Regarding case 8 of batch 314 laghu-kāvaḷī - @SergeA mentions that it might be non-Sanskrit. I'm leaving it marked as Sanskrit, since it seems the 3rd element in a list whose first two elements are Sanskrit words (per MW99).
image

This word also occurs in cases 8,9,10 under hw= DvANkza.

Found confirmation under DvANkza in PW: image

gasyoun commented 7 years ago

first two elements are Sanskrit words

Being headwords equals Sanskrit?

funderburkjim commented 7 years ago

Since kAkolI and kakkolikA are headwords in MW99 they are Sanskrit words; since laGu-kAvaLI is the third in this MW72 list of words, the inference is that it too is Sanskrit word. As shown, this inference is confirmed since the word appears in Devanagari in PWG from which we also infer that PWG considers laGu-kAvaLI to be a Sanskrit word.

gasyoun commented 7 years ago

PWG considers laGu-kAvaLI to be a Sanskrit word

What I meant is that there are few rare non-Sanskrit words in Sanskrit dictionaries as headwords as we've seen in MW, but hope it's not such case. The logic makes sense.

SergeA commented 7 years ago

Batch 318 finished. :)

funderburkjim commented 7 years ago

Everything installed.

All done.

Thanks, @SergeA !