Closed funderburkjim closed 7 years ago
Here is a suggested work flow for working on a batch.
Batch | case1-case2 | User | Date Begin | Date End | Installed |
---|---|---|---|---|---|
301 | 1-15 | @funderburkjim | 11/25/2016 | 11/27/2016 | 11/28/2016 |
302 | 16-30 | @SergeA | 11/28/2016 | 11/28/2016 | 11/28/2016 |
303 | 31-60 | @SergeA | 11/29/2016 | 11/29/2016 | 11/30/2016 |
304 | 61-90 | @SergeA | 11/29/2016 | 11/30/2016 | 11/30/2016 |
305 | 91-120 | @SergeA | 11/30/2016 | 11/30/2016 | 12/01/2016 |
306 | 121-150 | @SergeA | 11/30/2016 | 11/30/2016 | 12/01/2016 |
307 | 151-180 | @SergeA | 12/01/2016 | 12/01/2016 | 12/01/2016 |
308 | 181-210 | @SergeA | 12/01/2016 | 12/01/2016 | 12/02/2016 |
309 | 211-240 | @SergeA | 12/01/2016 | 12/02/2016 | 12/02/2016 |
310 | 241-270 | @SergeA | 12/02/2016 | 12/02/2016 | 12/02/2016 |
311 | 271-300 | @SergeA | 12/07/2016 | 12/07/2016 | 12/10/2016 |
312 | 301-330 | @SergeA | 12/07/2016 | 12/07/2016 | 12/10/2016 |
313 | 331-360 | @SergeA | 12/09/2016 | 12/09/2016 | 12/10/2016 |
314 | 361-390 | @SergeA | 12/10/2016 | 12/10/2016 | 12/10/2016 |
315 | 391-420 | @SergeA | 12/10/2016 | 12/10/2016 | 12/10/2016 |
316 | 421-450 | @SergeA | 12/13/2016 | 12/13/2016 | 12/13/2016 |
317 | 451-480 | @SergeA | 12/13/2016 | 12/13/2016 | 12/13/2016 |
318 | 481-496 | @SergeA | 12/13/2016 | 12/13/2016 | 12/13/2016 |
@SergeA
Does this approach sound ok?
I'll assume that you will be the primary worker on these cases. OK?
Note: As of now, batch 301 is complete. I've prepared only batch 302. When batch 302 is completed, I'll go ahead and prepare the rest of the batches.
Ok. I'll start right now. But there is a little problem. I cann't edit this table.
Finished. Batch 302 ready for installation :) Also the case 13. murcC (amūrćhīt) has typo in the headword - मुर्च्छ् instead of मुर्छ्.
I can't edit this table.
To edit the table, you click on the little pencil:
Do the editing, and click the 'Update Comment' button.
If you don't have permissions, let Marcis solve that.
For now, I'll edit table.
Batch 302 now installed. Everything looked perfect!
मुर्छ्.
Good to mention things like this. An alternate way to communicate such 'extra changes' is in the 'comment' section for the case in the UI, as I pay attention to these comments during installation.
Notice that the PROGRESS TABLE above is now updated to show the installation has been done.
Also, anyone revisiting the UI for batch 302 will see a message THESE CORRECTIONS ARE INSTALLED
.
All the batches have now been generated.
You can use these links to get to the batches, if you find the links convenient:
batch 301 batch 302 batch 303 batch 304 batch 305
batch 306 batch 307 batch 308 batch 309 batch 310
batch 311 batch 312 batch 313 batch 314 batch 315
batch 316 batch 317 batch 318
One unexpected thing occurs in the batches: Although there are supposed to be 30 cases, sometimes there are more than 30. This is especially noticeable in:
If this proves to be a problem (e.g., updates are too slow), let me know and I can break these up into smaller sets of actual cases.
NOTE: I think the reason for this difference between nominal and actual batch size is:
case
is a particular Sanskrit word in a particular MW72 headword.Tomorrow, I'll set up a similar system for dealing with the Wilson 3-gram cases, and will begin work on them.
@SergeA Happy correcting! Thanks for the help 👍
I cann't edit this table.
Done.
Why twice prayuj
?
You can use these links to get to the batches, if you find the links convenient
Thanks.
Sometimes, a MW72 headword can be long, and can contain multiple copies of the particular Sanskrit word in question.
Yeah, that seems to be the case in prayuj
above as well.
303 + 304 ready
@gasyoun Glad you noticed the 'prayuj' example, as it indicated several things of interest.
Under headword prayuj, there are two 3-gram cases:
yug-GavIMzi:1:prayuj,32610,146337,yug-ghavīṉshi##unknowns=IMz
havIMzi:1:prayuj,32610,146336,havīṉshi##unknowns=IMz
yug-ghavīṉshi
and havīṉshi
. The second one is a substring of the
first. For the yug-ghavīṉshi
case, the program generated a case
for 146337 - this is good. But, the program was also generating, for the havīṉshi case, a potential
correction involving also involving line 146337 -- this is bad.When regenerating the ending batches (316 -321) using the improved matching, I noticed that the number of actual cases decreased some (e.g. for batch 317, the number of cases decreased from 205 to 180.
A closer examination of batch 317 cases showed that nearly all the 'extra' cases involve a particular error under headword 'zaz' (hk = SaS).
Śh
(with an accent over 'S') instead of Sh
(no accent) for the
retroflex sibilant.This Śh
is ALWAYS an error since Ś
(capital S with accent) represents the palatal sibilant, and
is not followed by 'h' in any Sanskrit word.
Further, although the correction to some hypothetical word beginning Śh
could be to remove
the 'h', since all those 96 cases occur under zaz
, it is almost certain that the actual correction
required is to remove the accent over the 'S', i.e., to change Śh
to `Sh'.
As further confirmation of the nature of the change to make, here is the list of words - focus on the IAST spelling before the '##'. You'll see that all of them are some sandhi-altered form of 'zaz'. zaz-examples.txt
Since there are so many such cases, and since the solution is always the same, it makes sense to write a little program to generate all the corrections.
This has been done. There were changes on 97 lines of mw72.txt. 89 of these lines were under headword zaz; the other 8 were under other headwords whose slp spelling begins with 'za'
Next, the ngram3cases.txt file was modified to remove the Śh
cases - These occurred starting with former case 437.
After this removal, there are 496 total cases.
The removed case was in batch 316. So batches 316-318 were regenerated.
Since there are now only 496 nominal cases, batches 319, 320 and 321 are not required.
For completeness, batches 306-315 have also been regenerated; @SergeA has already worked on batches 303, 304, and 305, so these have not been regenerated.
Although the nominal-actual situation is still theoretically possible, with the removal of the Sh cases and the removal of the substring problem, extra actual cases are now rare (biggest instance, 36 actual cases v. 30 nominal cases in batch 313).
Marked two similar print errors in MW72. chṛd/chṛntte but ācchṛd/ācchṛnte instead of ācchṛntte chid/chintte but avacchid/ avacchinte instead of avacchintte Perhaps there are more similar cases.
Marked two similar print errors in MW72. chṛd/chṛntte but ācchṛd/ācchṛnte instead of ācchṛntte chid/chintte but avacchid/ avacchinte instead of avacchintte Perhaps there are more similar cases.
झरो झरि सवर्णे 8.4.65 rule allows both 'ntt' and 'nt' as valid form.
Welcome aboard @SergeA.
The program has now been adjusted to avoid this substring matching problem.
Great.
When regenerating the ending batches (316 -321) using the improved matching, I noticed that the number of actual cases decreased some (e.g. for batch 317, the number of cases decreased from 205 to 180.
The better.
This Śh is ALWAYS an error since Ś (capital S with accent) represents the palatal sibilant, and is not followed by 'h' in any Sanskrit word.
Indeed. MW72 transliteration is miserable in all possible ways.
This has been done. There were changes on 97 lines of mw72.txt. 89 of these lines were under headword zaz; the other 8 were under other headwords whose slp spelling begins with 'za'
Good job, Jim. Methodological as usual.
Although the nominal-actual situation is still theoretically possible
Ignore such cases, please.
झरो झरि सवर्णे 8.4.65 rule allows both 'ntt' and 'nt' as valid form.
I wonder Dhaval if other dictionaries us the ntt
variant as well.
Marked two similar print errors in MW72. chṛd/chṛntte but ācchṛd/ācchṛnte instead of ācchṛntte chid/chintte but avacchid/ avacchinte instead of avacchintte Perhaps there are more similar cases.
झरो झरि सवर्णे 8.4.65 rule allows both 'ntt' and 'nt' as valid form.
Perhaps, by Panini it is valid. Panini has many complicated rules of optional reduplication and elision of letters. But from the european point of view we have here "d" from the root (chid/chind or chṛd/chṛnd) and "t" from the termination (-te) resulting "tt" without any additional constrictions. Besides, it is inconsistent to write sometimes chintte and sometimes chinte. I think it was a mere error here due to confusion of त्ते and ते.
Welcome aboard @SergeA.
Thanx. :)
Re ntt v. nt .
It is tough to decide how to handle (a) grammatical options and (b) author inconsistency.
My view at the moment is that it is premature to impose consistency on MW72 at this time -- we are dealing with more mundane issues now. Since optional consonant doubling in these cases is grammatically justifiable, I think we should leave the spellings of MW72 as they appear in the text.
@SergeA I suggest you open a new issue, perhaps with the label 'Research' and repeat the arguments that you and Dhaval have raised. This new issue can remain open and therefore visible. When time permits, we could do complete text-wide examination of 'similar' cases. This data would provide a solid basis for saying that certain variants should be considered print errors. After such an investigation, we might end up with the changes you suggested.
Incidentally, there are many annoying inconsistencies in the 1899 edition of MW as well; especially in use of anusvara or homorganic nasal.
I can dimly imagine a time when we, or someone, make a new-improved MW dictionary, without inconsistencies, and with many other improvements.
Since optional consonant doubling in these cases is grammatically justifiable, I think we should leave the spellings of MW72 as they appear in the text.
And maybe add some tag, the not OK are equal OK?
Incidentally, there are many annoying inconsistencies in the 1899 edition of MW as well; especially in use of anusvara or homorganic nasal.
Yeah, that's a swamp.
in batch 303, case 19. idAnIms
The 's' is not part of the Sanskrit word; it is an English 's' for plural . I'm also not sure about that 'm', as idAni seems likely, as your comment said. For now, just moving that 's' out of the scope of Sanskrit (i.e. as a separately marked italic non-sanskrit.)
{%idānīm%}<nsi>s</nsi>
[That 'nsi' tag is used in mw72 to indicate 'non-sanskrit-italic]
in batch 303, case 22.
I think the dental 'd' should be retroflex: īdāṅ-ćakre
-> īḍāṅ-ćakre
,
as this seems to periphrastic perfect of īḍ
.
32809 old <>{%īḍishe,%} Ved. {%īḷishe%}), {%īdāṅ-ćakre, īḍishyate,%}
32809 new <>{%īḍishe,%} Ved. {%īḷishe%}), {%īḍāṅ-ćakre, īḍishyate,%}
Found confirmation in vcp.
ईड [p= 1008] : ईड¦ स्तुतौ अदा० आत्म० सक० सेट् । ईड्वे ईडिषे ईडिध्वं
ऐडिष्ट ईडाम्--बभूव आस चक्रे । ईडिता ईडिष्यते ऐडि-
Classify as print error. Put comment in UI.
in batch 304:
; Case 12. L=13389, key1=kawu, dict=mw72, type=p,status=DONE
; kaṭutarāi -> kaṭutarāï
44576 old <>of a plant, {%= tikta-tuṇḍī,%} commonly {%kaṭutarāi.%}
44576 new <>of a plant, {%= tikta-tuṇḍī,%} commonly <nsi>kaṭutarāï.</nsi>
I think that kaṭutarāï is a Tamil word, not Sanskrit, so have marked as <nsi>
--- Can anyone
confirm that this word is a plant name in Tamil --- or, if I'm wrong, that it is in fact a Sanskrit word?
I think we should leave the spellings of MW72 as they appear in the text.
Ok.
@SergeA I suggest you open a new issue, perhaps with the label 'Research' and repeat the arguments that you and Dhaval have raised.
Could you please do it youself. You know the right way. And I'm a bit unfamiliar with this GitHub system.
I think that kaṭutarāï is a Tamil word, not Sanskrit
mentioned in SKD: कटुतुण्डी, स्त्री, (कटु तीव्रं तुण्डमस्याः ।) लताप्रभेदः । कटुतराइ इतिख्याता ।
batch 304, case 24: @SergeA
I marked all three words as non-Sanskrit. Agree?
; kāḷī -> kāḷī, non-Sanskrit.
196641 old <>dark Śālmali ({%= Marāṭhī kāḷī sāmvarī%}) {%= vaṉśa-%}
196641 new <>dark Śālmali (= <nsi>Marāṭhī kāḷī sāmvarī</nsi>) {%= vaṉśa-%}
``
batch 304, case 29. Changed to print error:
; Kṛitoććhais -> Kṛitoććhais CHANGE NOT MADE
56282 old <>jealous. {%--Kṛitoććhais (ºta-ućº),%} ind. raised on high.
56282 new <>jealous. {%--Kṛitoććais (ºta-ućº),%} ind. raised on high.
Reason: uććais
is the word (indeclineable) for 'high'; I don't think there is a word uććhais
.
kaṭutarāï
Since found in SKD, will change back to calling it Sanskrit.
uććais is the word (indeclineable) for 'high'
Sure. Mea culpa.
jahāngīrī (Batch 307, case 8). The 'non-Sanskrit?' speculation seems highly likely (Persian) and I've so marked.
As I can see, MW72 provides many suspicious words in definitions of names of plants, names of places etc. Often they follow after "=" or "commonly". Perhaps those are denominations from some local dialects. Sometimes he marks it as Hindi, Marathi etc, and sometimes no. It is very difficult to say if this is a Sanskrit or not. In theory every such name or term can be borrowed from any language to Sanskrit and can be used in Sanskrit texts. The only limitation is the alphabet.
Regarding case 8 of batch 314 laghu-kāvaḷī
- @SergeA mentions that it might be non-Sanskrit.
I'm leaving it marked as Sanskrit, since it seems the 3rd element in a list whose first two elements are
Sanskrit words (per MW99).
This word also occurs in cases 8,9,10 under hw= DvANkza.
Found confirmation under DvANkza in PW:
first two elements are Sanskrit words
Being headwords equals Sanskrit?
Since kAkolI and kakkolikA are headwords in MW99 they are Sanskrit words; since laGu-kAvaLI is the third in this MW72 list of words, the inference is that it too is Sanskrit word. As shown, this inference is confirmed since the word appears in Devanagari in PWG from which we also infer that PWG considers laGu-kAvaLI to be a Sanskrit word.
PWG considers laGu-kAvaLI to be a Sanskrit word
What I meant is that there are few rare non-Sanskrit words in Sanskrit dictionaries as headwords as we've seen in MW, but hope it's not such case. The logic makes sense.
Batch 318 finished. :)
Everything installed.
All done.
Thanks, @SergeA !
This issue is devoted to corrections of MW72 text, as begun in #320.
The text in mw72 identified as (a) italic and (b) Sanskrit have been examined for possible spelling errors. In this study, possible errors are chosen on the basis of the word having an SLP1 spelling with an unusual 3-gram (when compared to 3-grams of all headwords in Sanskrit dictionaries).
600 cases have been so identified.
These cases have been broken into smaller batches, identifed by a batch number of 301 to 321. Each batch has 30 cases, except for the first two which have 15 each.
There is a User-Interface (UI) for marking corrections. The url for the UI depends on the batch number. Here is the url for batch 302: