sanskrit-lexicon / CORRECTIONS

Correction history for Cologne Sanskrit Lexicon
8 stars 5 forks source link

`o` vs `O` Corrections in PWG, Part 1 #130

Closed zaaf2 closed 8 years ago

zaaf2 commented 8 years ago

This issue is about an analysis of the data contained in the file http://drdhaval2785.github.io/o_vs_O/output1/PWG.html, generated by the o_vs_O method of highest probability (one dictionary in first word and more dictionaries in second word), as applied to PWG.

OCR error.

image

zaaf2 commented 8 years ago

@gasyoun I find at the Merriam-Webster Dictionary:

“Variant forms” or “variants” sounds perfect to indicate two or more spellings of the same word, even if attested in different dictionaries.

zaaf2 commented 8 years ago

76. काटुकि ― कटुकी (SKD,SNP,VCP) No change. Different words. PWG: काटुकि(?) [L=70260] [p= 5-1277] in चन्द्र°. MW: चन्द्र-काटुकि [p= 386] : m. N. of a man Pravar. iii, 3. [L=71621] SNP: kaṭukī (1) Picrorrhiza kurroa Royle Ex Benth. (…) SKD: कटुकी, स्त्री, (कटु + स्वार्थे कन् । गौरादित्वात् ङीष् ।) कटुका (…) VCP: कटुकी¦ स्त्री कटु + स्वार्थेकन् गौरा० ङीष् । (कट्की) १ कटु- कायाम् । कटुकाशब्दे गुणपर्य्यायादि ।

77. कालाय ― कालय No change. Different words. A form considered a wrong reading by PWG. PWG: कालाय [L=70592] [p= 5-1291] — 1) zu streichen, da an der angeführten Stelle क्वालापाः zu lesen ist; vgl. Spr. 778. MW: कालय [p= 279] : Nom. P. °यति, to show or announce the time Dhātup. xxxv, 28 (v.l.) [L=49599]

78. कुशाविन्दु → कुशविन्दु OCR error. image

79. केतसाप् ― केतसप् No change. Different ways to present the same word. PWG shows the strong form of the nominative. image

MW: केत-सप् [p= 308] : m(nom. pl. -सापस्)fn. obeying the will (of another), obedient [" touching the sky " Sāy. ], v, 58, 3. [L=55622 PW: केतसप् [L=30604] [p= 2098-2], (stark °साप्) Adj. dem Willen folgend , folgsam.

80. गर्धभि ― गर्दभी No change. Different words. image

PWG: गर्दभि [L=21932] [p= 2-0700] m. N. pr. eines Mannes Mbh. 13, 258 (गर्द्धभि)। हयगर्द्धभि (sic) ein Bein. Çiva's 1149. PWG (supplement): गर्दभि [L=119755] [p= 7-1738] an der ersten Stelle liest ed. Bomb. गार्दभि, an der zweiten हयगर्दभि. Z. 2 ist 1149 st. 1149 zu lesen. MW: गर्दभी a [p= 349] : f. a she-ass AV. x ṠBr. xiv Kauṡ. MBh. &c [L=63922]

gasyoun commented 8 years ago

@zaaf2 sure, but I would use 1 term for variants inside one dictionary and 2nd for variants between 2 different dictionaries. Agree?

drdhaval2785 commented 8 years ago

@funderburkjim

The hardest part of this would be understanding the structure of Dhaval's file. Possibly it would be
easier to work from some precursor of Dhaval's file -- do we know where is the code that created
that file?

The code is placed at https://github.com/drdhaval2785/Sanskritspellcheck with a readme how to use. If you are particularly interested in o_vs_O method code usage, it is here.

It is a CLI tool. I was naive in CLI tool in PHP at that point of time. So most of the refactoring were done by you only. So, the code should be easy to be understood by you too.

zaaf2 commented 8 years ago

@gasyoun Agreed. So, “variants” only when the variant readings are found within one and the same dictionary; “variant forms” or “variant readings” or “variant spellings” when between two or more dictionaries. Did I get it right?

gasyoun commented 8 years ago

@zaaf2 you get it right. But there is a problem. MW already uses w.r. (wrong reading) inside one, his dictionary. And variant readings sounds very close to wrong reading. Too close. I guess variant forms is the best possible choice.

zaaf2 commented 8 years ago

81. गिर्ववाह् ― गिर्ववह् (GRA,MW,PW) No change. PWG gives the strong form of the word. Cf. cases 68, 69, and 79. PWG:

MW:

82. गौष्ठी ― गोष्ठी (BOP,CCS,IEG,MD,MW,SKD,STC,VCP) No change. Different words. image MW: गो-ष्ठी a [p= 367] : f. an assembly, meeting, society, association, family connections (esp. the dependent or junior branches), partnership, fellowship MBh. (metrically °ष्ठि, v, 1536) &c [L=67694]

83. चतुर्दशगुणानामन् → चतुर्दशगुणनामन् OCR error. image

84. जायन्ती ― जयन्ती No change. Different words. PWG: जायन्ती [L=27327] [p= 3-0088], (wohl von जयन्त्, partic. von जि, oder von जयन्त) f. N. pr. °पुत्र N. pr. eines Lehrers Bṛh. Âr. Up. 6, 5, 2. MW: जयन्ती a [p= 413] : f. a flag L. [L=77559]

85. दार्व्य ― दर्व्य No change. Different words. image MW:

86. दुःशाक ― दुःशक No change. Different words. image MW: दुः-शक [p= 483] : mfn. impracticable, impossible [L=93290]

87. द्विचतुरस्रक ― द्विचतुरश्रक No change. Variants (cf case 89. नवास्र ― नवाश्र). PWG:

MW: द्वि-चतुर्-अश्रक [p= 504] : m. N. of a partic. gesture or posture Vikr. (v.l. चतुर्-अस्र्°). [L=98430]

88. ध्वर ― ध्वण (KRM,SHS,SKD,VCP,WIL) No change. Different words. PWG: ध्वर [L=37220] [p= 3-1010], (von ध्वर्) s. अध्वर. [Page03.1011] SHS: ध्वण [L=20854] [p= 374-b] r. 1st cl. (ध्वणति) To sound. भ्वा० अक० पर० सेट् ।

89. नवास्र ― नवाश्र No change. Variants (very confusing; cf. case 87. द्विचतुरस्रक ― द्विचतुरश्रक). PWG:

MW:

90 नितान्तावृक्षीय ― नितान्तवृक्षीय No change. Varia lectio (v.l.) image MW: नि-°तान्त-वृक्षीय [p= 547] : mfn. (v.l. °न्ता_वृ°) g. उत्करा_दि. [L=108303]

91. पारढी ― पारडी (SCH,STC) No change. Different words. PWG: पारढी [L=44608] [p= 4-0668] Verz. d. B. H. No. 903 (Xxi). SCH: pāraḍī [L=18364] [p= 253-3] Kleid? Śuk. t. s. 107 , 3. 7. º 1 STC: पारडी [L=13680] [p= 430,1] pāraḍī- f. vêtement (?).

92. पुरासाह् ― पुरासह् (GRA,PW) No change. That was the form intended by the PWG author. A lexicographical error? image

PW: पुरासह् [L=68099] [p= 4099-1] Adj. ( Nom. °षाट्) von jeher überlegen. GRA: purA-sah, [L=5539] [p= 0827] purā-sáh, Nom. purā-ṣâṭ, a., von Alters her siegreich. -ṣâṭ índras 900,6.

93. पूस ― पुस (SHS,VCP) No change. Different words. PWG: पूस [L=78914] [p= 5-1610] m. Papagei Hâla 265. SHS: पुस [L=25567] [p= 463-a] r. 10th cl. (पोषयति-ते) 1. To rub. 2. To damage, चुरा० उभ० सक० सेट् । [Page463-b+ 60]

94. पूयवाह ― पूयवह No change. That was the form intended by the PWG author. image MW: पू*य-वह [p= 641] : m. " filthy-streamed ", N. of a partic. hell VP. [L=127540.1]

95. प्रतिपत्नी ― प्रतिपत्नि No change. Variants (metri causā) PWG: प्रतिपत्नी [L=79167] [p= 5-1618] f. Nebenbuhlerin: प्रतिपत्निवत् (aus metrischen Rücksichten verkürzt) Bhâg. P. 11, 6, 12. MW: प्रति-पत्नि [p= 662] : f. (mc. for °त्नी) a female rival (-वत् BhP. ) [L=131578]

96. भारवाह् ― भारवह् No change. PWG gives the strong form of the word. Cf. cases 68, 69, 79 and 81. PWG: भारवाह् [L=54610] [p= 5-0253], (भार + वाह्) nom. ag. eine Last führend, tragend Vop. 4, 12. f. भारौही ebend. MW: भार-वह् [p= 753] : (strong form -वाह्) mf(भारौही)n. carrying a lowest Vop. [L=149902] {OCR error in MW: carrying a lowest → carrying a load}

97. भिक्षुकीपारक ― भिक्षुकीपराक No change. That was the form intended by the PWG author. image MW: भिक्षुकी-पराक [p= 756] : m. or n. (?) N. of a building Rājat. [L=150721]

98. भूवाह् ― भुवः (AP,AP90,INM,KRM,SKD) No change. Different words. PWG: भूवाह् [L=55580] [p= 5-0360], (2. भू + वाह्) adj., gen. भूहस्, instr. भूहा Vop. 3, 103. AP90: भुवः [L=22237] [p= 0820-a] Ved. 1 Fire. — 2 The earth (भुवोलोक).

99. भूवाह् ― भूवह् No change. PWG gives the strong form of the word. Cf. cases 68, 69, 79, 81 and 96. MW: भू-वह् [p= 761] : (strong form -वाह्, weak भुह्) mfn. Vop. [L=151748] {OCR error in MW: weak भुह् → weak भूह्} image

100. महासफर ― महाशफर No change. That was the form intended by the PWG author. image MW: महा-शफर [p= 801] : m. a species of carp Bhpr. [L=160906]

101. मांसेपाद् ― मांसेपद् No change. PWG gives the strong form of the word. Cf. cases 68, 69, 79, 81, 96, and 99. MW: मांसे-पद् [p= 805] : (strong from पाद्) m. a species of animal Kāṭh. [L=161881]

102. मूखदूषण → मुखदूषण OCR error. image

gasyoun commented 8 years ago

As of 92 PW is younger than PWG. That means if something might be wrong in PWG and is fixed in PW I would go for PW.

zaaf2 commented 8 years ago

748. दाविककूल (MW72,PWG) → दाविकाकूल (MW) Factual error (transcribed by MW72). See discussion at #134 (Re 235.) PWG: image Pāṇini 7.3.1 (Böhtlingk’s edition, Leipzig 1887): image MW: image

zaaf2 commented 8 years ago

@gasyoun Re 92. I proposed “no change” following what was discussed above, about changing or not lexicographical errors. Of course it is much more probable that PW corrected a previous error in PWG. But I think our objective is not to correct the errors committed by the PWG author, but the errors made in spite of his intentions.

zaaf2 commented 8 years ago

Re 748. dAvikAkUla ― dAvikakUla MW72,PWG (दाविकाकूल ― दाविककूल) I am not so sure any more. PWG’s दाविककूल could be defended. कूल is neuter. But then there is Böhtlingk’s Pāṇini edition. A change here would be problematic. Better to leave it as it is. No change.

zaaf2 commented 8 years ago

As could be observed at #131 (Re 247. niHzAmam -> niHzamam), there is an OCR error under PWG निःषम (due to the poor quality of the printed text):

दुःपमम् → दुःषमम्

PWG:

image

Pāṇini 8.3.88 (Böhtlingk’s edition, Leipzig 1887): image

gasyoun commented 8 years ago

On 748 I just had a talk with Sergey from Moscow. He said the same just hours before and I could not post it before. Böhtlingk’s edition, Leipzig 1887 is a great source for comparison in those rare cases, where it's quoted. I even have the original edition on my desk, but, shame, did not open it. So no change - bad idea. dAvikakUla in MW72 is wrong, based on PWG. Böhtlingk’s has it right, so does MW. This one has to be fixed, as it's rude and well documented. 247. poor quality of the printed text = invisible :+1: It's just time @zaaf2 to close Part 1 and start Part 2, before it get's too long.

funderburkjim commented 8 years ago

Re 46. लक्षणवादरहस्य ― लक्षणावादरहस्य

This was concluded to be a NO-CHANGE.

While not disagreeing with the choice, the thought occurs that we should consider the two spellings to be variants. Currently there is no provision in the dictionaries to handle variant spellings. If there were a system for identifying 'equivalent' spellings, this would be such a case.

funderburkjim commented 8 years ago

Re: 51. विचित्वरा ― विचित्वारा

The form of the record (having the parenthetical (विचित्वारा) following the headword) may be a pattern using in PWG to identify alternate spellings.

Everyone should realize that we are now applying to other dictionaries (PWG in this case) the kind of scrutiny that was applied to MW several years ago. One upshot of this scrutiny is that we see things where additional markup would help to expose (and therefore make useable) features of the dictionary. In particular, adding markup to identify alternate spellings, as here , would probably add to the utility of the dictionary.

To give an idea of what I mean by 'additional markup', here's a seat-of-the-pants possibility for addtional markup in this case (I'm adding markup to a record of pwg.txt):

current record of pwg.txt:
<H1>000{vicitvarA}1{vicitvarA}¦ ({#vicitvArA#}) s. u. {#vijitvara#} .

possible additional markup:  put the author-identified variant spelling in an '<OR>' tag:
<H1>000{vicitvarA}1{vicitvarA}¦ (<OR>{#vicitvArA#}</OR>) s. u. {#vijitvara#} .

Note that only markup (XML-tags) has been added - the text has not been changed.

With such markup, programs could make use of the markup, for instance, to generate a list of headwords INCLUDING VARIANTS. Perhaps such a list could replace pwghw2.txt.

Just a thought.

funderburkjim commented 8 years ago

Re. 9. उत्पलवती ― उत्पलावती

Acc. to the Smith digitization of Mahabharata, utpalAvatim occurs at 06010033.

gasyoun commented 8 years ago

Just a thought will remain such if no Jim around. But anyway - that's not top priority. Although it might increase the total number of possible words, 434k is quite impressive already.

funderburkjim commented 8 years ago

re '® is a markup for plants in PW.' @gasyoun is right. This was markup that Thomas put in the original digitization. This feature is documented in the 'pw-meta.txt' file, which is part of the pwtxt.zip , one of the pw download items.

Incidentally, in MW this would be marked as <bot>xxxx</bot>. It could not do any harm to bring the markup conventions into greater consistency across the dictionaries.

funderburkjim commented 8 years ago

Regarding 66 makes sense. I agree with change. I've come down from the ledge of 'OCR changes only'; thanks for talking me down before I jumped!

Am trying to think how to add markup to the digitization. Current idea is that the markup should be simple such as :

<pc old="OLD">NEW</pc>
'pc' == Print Change
meaning that the printed form was OLD, and we have changed to NEW.
The markup would be the same, regardless of the reason.

Such changes should also be documented in a file for each dictionary, the file being called something like pwg_printchange.txt . This is a more neutral-sounding name than 'corrections_factual',

The displays can use the markup to provide a brief indication that the digitization intentionally differs from the print edition, and link to the printchange.txt file.

The printchange file can have the free form of current corrections_factual, and in particular have links to relevant github issues (such as this #130 issue).

For cases where the change is to a headword, we could also take this into account via the hw2 file, as mentioned for the <OR> suggested markup mentioned above.

The above sounds like it might have the virtues of

funderburkjim commented 8 years ago

@zaaf2 Would you elaborate on your 'crowdsourcing' idea?

gasyoun commented 8 years ago

How about pwg_printerrata.txt instead of pwg_printchange.txt?

zaaf2 commented 8 years ago

Suggestion for crowdsourcing the work on @drdhaval2785's lists. A MW List Display search for दाविकाकूल (case 748),for example, would result in a screen such as this:

image

In the next screen we would have something like this:

image

funderburkjim commented 8 years ago

@zaaf2 Such a well-presented suggestion! Would you transfer it to another issue, so that it may remain under consideration when the corrections of this issue are installed?

funderburkjim commented 8 years ago

re केतसाप् ― केतसप् Also, there was a 'pad/pAd' similar case. I think it is normally true in PWG that the stem form is presented for nominals, as in MW. I wonder how prevalent it is to find that, as in these sap/sAp and pad/pAd cases, PWG uses a nominative singular form as the headword citation form.

From 81, you've also identified 'vah/vAh' as a similar phenomenon . There you use the term 'strong form', which may be a better way to think of it than 'nominative singular'.

This is similar to the 'vat/vant' spelling variation.

So, maybe these can be tailored as additional alternate form spelling rules for hwnorm1.

funderburkjim commented 8 years ago

@drdhaval2785 Link not found:

https://github.com/drdhaval2785/SanskritSpellCheck/blob/master/o_vs_O/readme.txt

funderburkjim commented 8 years ago

@zaaf2 re 102. मूखदूषण → मुखदूषण Should this be called a print error?

funderburkjim commented 8 years ago

@gasyoun How about pwg_printerrata.txt instead of pwg_printchange.txt? I prefer the word change rather than the word 'errata'. The word 'change' is descriptive of what we are doing (changing the printed edition in the digital edition). The word 'errata' seems more presumptuous.

funderburkjim commented 8 years ago

re 748. दाविककूल → दाविकाकूल I think this change should be made. This is a compound, the first element of which is the name of a river, such names being always(?) feminine., i.e. kA.

@zaaf2 Agree?

funderburkjim commented 8 years ago

@zaaf2 Here is my summary of the corrections to be made based on this issue.

Would you double check that I've interpreted things properly?

Then, I'll install the corrections.

gasyoun commented 8 years ago

@zaaf2 Maybe Lexicographer errors instead of Lexicography errors?

zaaf2 commented 8 years ago

re 748. दाविककूल → दाविकाकूल Error in the PWG printed edition. I agree.

देविका f. is the name of the river. दाविक is the adjective, “(water) coming from the river देविका”. दाविकाकूल itself is also an adjective, “(rice etc.) coming from the banks (कूल) of the देविका”. I was not sure about the change because I thought the first member of the compound was the adj. दाविक, and I could not explain the second ā in दाविकाकूल. Now I see my doubt is unfounded. As one can see in the commentary to Pāṇini’s rule, the adj. दाविकाकूल comes directly from the Tatpuruṣa compound देविकाकूल n. (which may be translated as “bank of the देविका river”). When देविकाकूल as a whole is transformed into the adjective by an (absorbed) -a suffix (v. Whithey 1208.h), then the special rule in question takes effect, and दे- is changed to दा-, the rest of the word remaining unchanged. image

zaaf2 commented 8 years ago

@gasyoun I am not aware I used the expression Lexicography errors. Lexicographer errors? Perhaps Lexicographer’s errors would be better? I would go for Lexicographical errors. We say typographical errors, not typographer errors.

zaaf2 commented 8 years ago

@funderburkjim re 102. मूखदूषण → मुखदूषण Yes. Error in the printed PWG edition. I mistakenly saw an OCR error. image There is no मूख. MW: मुख-दूषण [p= 819] : n. (L. ) (Bhpr. ) " mouth-defiler ", an onion. [L=164884] मुख [p= 819] : n. (m. g. अर्धर्चा*दि ; ifc. f(आ, or ई). cf. Pāṇ. iv, 1, 54, 58) the mouth, face, countenance RV. &c , &c [L=164836]

funderburkjim commented 8 years ago

Re: 71. आज्ञाप्ति ― आज्ञप्ति No change. OCR error.

I think this should be changed, as an "OCR error" (typo). As MW has AjYapti but not AjYApti. @zaaf2 Agree?

funderburkjim commented 8 years ago

Corrections now installed. pwg_printchange.txt also made part of this CORRECTIONS repository.