sanskrit-lexicon / SKD

Discussion of corrections and other issues pertaining to Sabdakalpadruma dictionary at Sanskrit-Lexicon
0 stars 0 forks source link

Correction 'dbika' to 'dvika' #1

Open funderburkjim opened 10 years ago

funderburkjim commented 10 years ago

Shalu submitted to the Sanskrit-Lexicon correction form the correction of 'dvikaM' for 'dbikaM' noticed under headword 'ci'.

In processing this, I edited (in Emacs) the entire 22MB digitization file skd.txt, to find the line (line # 117174 in skd.txt) that needs correction:

117174 old <>(bhvAM-ubhaM-dbikaM-aniT |) Ja, cayati cayate
117174 new <>(bhvAM-ubhaM-dvikaM-aniT |) Ja, cayati cayate

Since the conjuncts 'db' and 'dv' are hard to distinguish visually in Devanagari, it seems likely that there are other cases that need correction. A search for 'dbika' in skd.txt finds 37 cases (including the one above):

Note: these lines are in Harvard-Kyoto transliteration.

37 matches for "dbika" in buffer: skd.txt
 113700:<>ubhaM-dbikaM-seT |) e, acadIt | Ja, cadati
 117174:<>(bhvAM-ubhaM-dbikaM-aniT |) Ja, cayati cayate
 132564:<>mRdbIkAyA balAyAzca siddhAH snehAH jvara-
 136314:<>dbikaM-aniT |) NopadezapATho nAditvena
 167327:<>yavabrIhivadbikalpa iti granthagauravAtte na
 173867:<>trikarAzcoparathyAstu dbikarApyuparathyikA |
 173921:<>“dbikacchaH kacchazeSazca muktakacchastathaiva ca |
 183633:<>samAsvAdi sa mRdbIkArase rasavizAradaiH ||’
 183788:<>jAgratastadbikazati svapatazca nimIlati ||
 197474:<>vam | paTolam | ruvutailam | mRdbIkA | zveta-
 197570:<>gavAjayoH payaH | mRdbIkA | parUSakam | kharjjU-
 198208:<>phalam | vetrAgram | ASAD2hakam | mRdbIkA |
 206600:<>nimbaM paTolaM triphalAM mRdbIkAM mustavatsakau ||
 206730:<>“jalaM kharjjUramRdbIkAmadhukaiH saparUSakaiH |
 206977:<>pASANabhidbikaTamUlakRtaH kaSAyaH |
 210184:<>tadeva khaNDamRdbIkAzarkarAsahitaM punaH ||
 238374:<>mRdbIkAyAH palAnyatra catvAri kathitAni hi |
 244794:<>saMvatsarAbhyantare mAsaikadbikAdau cAndrA-
 249904:<>(tanA0-Atma0-dbika0-seT | ktvAveT |) da Ga,
 250194:<>AdhiH sa ca dbividhaH savRdbikamUlyApAkaraNArtho
 268304:<>kalpadrumaH || (bhvA0-Atma0-yAcane dbika0-lAbhe
 285775:<>dbikarSamAtrANyetAni pratyekaM kArayed-
 309434:<HI>mRdbIkA, strI, (mRdu + bAhulakAt Ikan TAp
 310126:<>zcidbikArAn vAtapittakarAn ghorAn aznute
 316898:<>(bhvA0-ubha0-dbika0-seT |) yAcanamAtmane
 319964:<>sa tu dAD2imamRdbIkAyuktaH syAdrAgaSAD2avaH ||”
 323490:<>aTarUSakamRdbIkApathyAkvAthaH sazarkaraH |
 323937:<>dbikarSahatalohaM syAccatuSkarSA sitA bhavet ||
 327618:<>tiktaH kaSAyamanveti te dbikA daza paJca ca ||
 339683:<>(bhvA0-ubha0-dbika0-seT |) R, arireTat |
 367181:<>dbikaM zataM vA gRhNIyAt satAM dharmmamanusmaran ||
 381893:<>nimbaH paTolastriphalA mRdbIkA mustavatsakau |
 387319:<>bhavati | brAhmaNe'dhamarNe dbikaM zatam | kSattriye
 394351:<>dbikAlaJca namet sandhyAmagnInupacarettathA ||
 405129:<>jAgratastadbikasati svapatastu nimIlati ||
 407240:<>trikarAzcoparathyAstu dbikarApyuparakSakAH ||”
 409269:<>sarpiH svapANitalasammitaM dbikAlaM pAyayet || * ||

I suspect that in all of these, 'dbika' should be 'dvika'.

Would others take a look at these, and mention any that should NOT be so changed? or confirm that ALL should be changed?

drdhaval2785 commented 10 years ago

There will be many such instances considering the fact that SKD author was a native of Eastern India. In Bengali and Oriya, it is usual to write 'b' instead of 'v'. Therefore, If we want to keep integrity of original text intact, better not to do changes outright, but after consulting the original book

drdhaval2785 commented 10 years ago

If as a moderator you chose to change b-v / v-b, please post it in the FAQs so that people know it well

drdhaval2785 commented 10 years ago

My input here is Bengal is a small part of India. In most of Indian subcontinent 'dvika' is preferred over 'dbika'

gasyoun commented 10 years ago

I would not change anything (and believe me I've done it wrong so many times before that I now suppose it's better to leave it bengali, than to have it "right"). I would ad it to FAQ and make search algorithm know that it can be the same. In Ayurvedic text we see basti and vasti - both are used and so we should not break the Bengal tradition just because we know why it's there. What you speak about is a badly needed addition, but not changing of the original book.

Shalu411 commented 10 years ago

The issue here is not about single v>b or b>v. Its about db>dv ; It is a specific case. And a relatively smaller issue. Such words, unchanged, will be lost for all non-Bengalis.. No Sanskrit-user searches for "dbika" in a Sanskrit Digital dictionary- even if one is a Bengali. Every Indian-Sanskritist is aware of differences in their mother tongues and Sanskrit. I do not know why SKD author (and also in VCP) has preferred those regional spellings..!! Now in this age of digitization, we need uniformity badly. The end users have to kept in view, before taking any decision.. not digitizers' choices. And its about one of "The only two" dictionaries existent for Sanskrit-Sanskrit. It is time to rectify for good. Digitizing problems can be met with- not standards. One "mistake of standard" done by one regional author should not make it eternal especially when we have a choice. Now its for the digitizers to decide. And moreover no copy-right issues-!! Why worry? One can always justify this act of changing for good to all SKD admirers. I prefer dv to db. Thanks

drdhaval2785 commented 10 years ago

for dv-> db change, I agree that we should reverse it to 'dv'. Now let us explore the safe ways to do so.

Jim, can you create four lists -

  1. 'd-b-x' occurrence where x is vowel in SKD
  2. 'd-b-y' occurrence where y is consonant in SKD. (1 and 2 I want to check whether there are occurrence of d-b-consonant anywhere in dictionary)
  3. 'd-b' occurrence in Apte / MW. (This will serve as a list which we will not violate while changing to 'dv', because the word may have 'db' in proper sanskrit also)
  4. 'd-b' occurence in SKD.

The next step would be:

  1. Take list 4 and compare it with list 3.
  2. Keep the words in list 4 which occur in list 3 intact.
  3. Change the rest of list words in list 4. ( 'db'->'dv').

This way we can save inadvertantly changing some true 'db' letters to 'dv'

drdhaval2785 commented 10 years ago

Just a caution

MW in my local machine gave many instances of 'db' occurences - which are correct from Sanskrit point of view and need not be corrected.

Few examples:

1] असद्बुद्धि / असद्--बुद्धि mfn. foolish BhP. [#21164] [Img:118,3]

[2] ईषद्बीजा / ईषद्--बीजा f. a species of grape (having no kernel) Nir. [#30805] [Img:171,2]

[3] उद्बन्ध् / उद्- A1. ( Pot. -बध्नीत )to tie up , hang one's self S3Br. xi , 5 , 1 , 8. [#34236] [Img:189,3]

funderburkjim commented 10 years ago

I agree there are many pitfalls in making 'global' changes.

One principle that might be used is: Assume the text is internally consistent.

For instance, there are 124 skd.txt lines spelled 'dvika' (or 'dvIka'). (see below) If we assume that the text is internally consistent, then perhaps we can infer that the 'dbika' spellings are all digitization misreadings; thus, we would be justified in changing all the 36 'dbika' instances to 'dvika'.

Incidentally, the 'db' and 'dv' conjuncts do seem to be visually distinguishable in skd, based on limited sample: 'db': image

'dv' (But, this was coded as 'db' - see first item in 'dbika' list above) - so this one is a digitization error, I think.
image

Here are the 'dvika' (or dvIka) instances:

124 matches for "dvika" in buffer: skd.txt
   3104:<>dvikarSamAtrANyetAni pratyekaM kArayedvudhaH |
  20102:<>avikAri bhavenmedhyamabhakSyaM tadvikArakRt” ||
  21010:<>dvikaM zataM vA gRhNIyAt
  21141:<>yaSTyAhvAzokamUlaJca mRdvIkA ca zatAvarI |
  23122:<>rmukhasya brahmaNaH pratimukhaM dvikarNatayA tathAtvam |)
  32654:<>yaSTIndIvaramRdvIkAtailAjyakSIralepanaiH |
  36551:<>“dvikarSaM lauhacUrNasya cAbhraJcApi palArddhakam |
  37940:<>pramohayan saMjanayedvikAram” ||
  42790:<>kuzanAmA 36 marupriyaH 37 dvikakut 38 durga-
  52567:<>tatparyyAyaH | mRdvIkA 2 gostanI 3 kapilaphalA 4
  52574:<>zitvaJca | iti rAjanirghaNTaH || (mRdvIkAzabde-
  54844:<>bhavedvikAraH ziraso'rddhabhedakRt” ||
  54894:<>karNapAlyAH sakarNAvayavatvAt tadvikAramapyatraivAha |
  59777:<>kaTakhAdakaH 18 dvikaH 19 kAgaH 20 | iti [Page2-072-a+ 52]
  61071:<>pauNDarIkarddhivRddhimRdvIkAjIvantyobhadhukaJceti” |
  62803:<>tAlIzapatrANyaparaM dvikarSam ||
  64236:<>svakAryyasAdhakatayA azItitamabhAgadvikazatAdi-
  67033:<>yathAvat sidvikAptAnAM satyavatyAH sutottama ! ||
  67600:<>mRdvIkArddhazataM triMzatpippalIH zarkarApalam |
  67602:<>tvagelAvyoSamRdvIkApippalImUlapauSkaraiH |
  67640:<>zasyAkena trivRtayA mRdvIkArasayuktayA |
  67795:<>mRdvIkArddhazataM triMzat pippalIzarkarApalam |
  70099:<>kmalac |) vikAsonmukhaprauD2hakalikA | ISadvika-
  74603:<>dvikaM zataM vA gRhNIyAt satAM dharmmamanusmaran ||
  74604:<>dvikaM zataJca gRhNAno na bhavatyarthakilviSI |
  74606:<>dvikaM purANadvayam | evaMvidhaniyamamatikramya
  77858:<>evetyarthaH | sarvvathA kimanenAsmAkamasadvikalpena |
  81351:<>kuzasyedaM tadvikAro vA aN | kuzamayam | kuza-
  83674:<>RddhiM parUSakaM bhArgIM mRdvIkAM vRhatIntathA ||
  83681:<>dvikArSikANi patrailA hematvaGmaricAni ca |
  96659:<>varSAdvikArakArI syAt kukSau vAtena dhAritaH ||
  96744:<>tadA madhukamRdvIkA candanaM raktacandanam ||
 105266:<>mRdvIkA hArahUrA ca gostanI cApi kIrttitA ||”
 106851:<>jJAnam | paJcatriMzadvikalAdhikanavatyuttarasaptazata-
 107232:<>mRdvIkAyAH palAnyatra catvAri kathitAni hi |
 113043:<>dvikaM-seT |) e, acatIt | Ja, catati catate |
 117177:<>ubhaM-dvikaM-aniT |) citizcayanaM rAzIkaraNa-
 117181:<>paraM-dvikaM-seT |) ka mi, cayayati cAya-
 122602:<>gale dvikaNTaH kila tasya pRSThe
 134901:<>bubhukSitaH kiM dvikareNa bhuGkte ||”)
 141772:<>eko ghAtaH sazabdo dvikala iha gurau zabdahIna-
 144762:<>daivAt pazyati so'pi vA zubhakarAnekaM dvikaM vA
 148475:<>“dvikaM trikaM catuSkaJca pakSakaJca zataM samam |
 164802:<>tatparyyAyaH | mRdvIkA 2 gostanI 3 svAdvI 4
 166077:<HI>dvikaM, klI, (dvAbhyAM kAyatIti | kai + kaH |)
 166080:<>dvikaM zataM vA gRhNAno na bhavedarthakilviSI ||”
 166086:<HI>dvikaH, puM, (dvau kau kakAravarNau yatra |) kAkaH |
 166088:<HI>dvikakAraH, puM, (dvau kakArau yatra |) kAkaH | iti
 166090:<HI>dvikakut, [d] puM, (dve kakudau yasya |) uSTraH |
 171428:<>paraM dvikAlapAyI syAdahnaH kAleSu buddhimAn |
 193107:<>(bhvAM-ubhaM-dvikaM-aniT |) pAko viklityanu-
 198950:<>tAmbUlaphalamAno yazcatustridvikatolakaH ||
 199378:<>trikarAzcoparathyAstu dvikarApyuparakSakA ||
 206356:<>“mRdvIkA maghukaM nimbaM kaTukArohiNIsamAH |
 206358:<>iti mRdvIkAdi || 17 ||
 207425:<>“kvathitAstriphalApAThAmRdvIkAjAtipallavAH |
 207439:<>“candanaM zArivAlodhramRdvIkAzarkarAnvitam |
 219204:<>parNAzanAkamRdvIkA phalgukharjjUrayaSTikA ||
 231017:<>(tRdA0-para0-dvika0-aniT |) jJIpsA jJAtu-
 236976:<>dvikarSahatalohaM syAccatuSkarSA sitA bhavet ||
 242138:<>‘vA syAdvikalpopamayorevArthe ca samuccaye |’
 244363:<C1> <C2>dvikAlabhojanocitabhakSyabhojyAnnaharaNe
 247523:<>“dvikarSaM lauhabhasmApi karSaM tAmraM pradApayet |
 253858:<>lAbhastenodayena sahitaM dvikaM trikamityAdi
 260617:<>ubha0-dvika0-seT |) la Ja bravIti | brUte |
 263256:<>para0-dvika0-seT |) R abIbhaNat ababhANat [Page3-478-b+ 52]
 272086:<>zcatuSkaM paJcakaM vA | duSTAyAH dvikaM trikaM vA |
 275346:<>(bhUtAni kSityAdIni tadvikArazca goghaTavRkSA-
 280642:<>madyaM kharjjUramRdvIkAparUSakarasairyutam |
 280694:<>mastukhaNDaM sakharjjUraM mRdvIkA dAD2imAmlikA |
 280771:<>hUram 56 mArdvIkam 57 madanA 58 devasRSTA
 282455:<>mRdvIkA hArahUrA ca gostanI cApi
 285257:<>dvikarNasya tu mantrasya brahmApyeko na budhyate ||”
 285753:<>“mando'gnirddehinAM kuryyAdvikArAn kaphasambha-
 296244:<>tAmbUlaphalamAno yazcatustridvikatolakaH |
 299052:<>trikarAzcoparathyAstu dvikarApyuparakSakA ||
 300423:<>mAlatIkalikAmAlAmISadvikasitAM hareH | [Page3-717-b+ 52]
 303851:<>mRdvIkAkaTukAvyoSadArvvItvaktriphalAghanam |
 303981:<>sakSaudrAstriphalApAThA mRdvIkAjAtipallavAH |
 309438:<>bIjapUrakamRdvIkAlakucAzca sadAD2imAH ||”)
 312777:<>sitAmadhukakharjjUramRdvIkAzca palonmitAH ||
 312887:<>UrddhvaM tiryyagadhaH kuryyAdvikArAn kupito'nilaH ||
 319898:<>sa tu dAD2imamRdvIkAyuktaH syAdrAgaSAD2avaH |
 325004:<>raGgAvatAripASaNDikUTakRdvikalendriyAH ||”)
 327627:<>evAnukrAntastiktena | ete paJcadazadvikasaMyogA
 327771:<>zcaturdvikau paJcadazaprakArau |
 327955:<>ISadvikAsinayanaM smitaM syAt spanditAdharam |
 329724:<>sa tu dAD2imamRdvIkAyuktaH syAdrAgaSAD2avaH |
 331094:<>trikarAzcoparathyAstu dvikarApyuparathyakA ||
 345517:<>yoktrayozca trikaJcaiva madhye paJcAgrake dvikam || [Page4-214-b+ 42]
 345778:<>dvikarSahatalohaM syAccatuH karSA sitA bhavet ||
 350587:<>para0-dvika0-seT |) la, vakti | au, vaktA |
 352800:<>Atma0-dvika0-seT | udittvAt ktvAveT |) da Ga,
 352903:<>tadvikAre camase nipUtaM dazApavitreNa zodhitaM
 358747:<>mRdvIkA kaTukA vyoSA dArvvI tvak triphalA
 359596:<>nirgadAnAsavAriSTasIdhumArdvIkamAdhavAn ||
 360514:<>paJcaviMzativarSANAmadhomAtrA dvikArSikA |
 360531:<>mUtramArge palonmAnA bAlAnAJca dvikArSikI |
 361163:<>(bhvA0-ubha0-dvika0-aniT |) prApaNamiha
 363375:<>elAguru ca mRdvIkA mAMsI vyAghranakho nakhI ||
 363486:<>paTolaM ruvutailaJca mRdvIkAzvetazarkarA |
 367182:<>dvikaM zataJca gRhNAno na bhavatyarthakilviSI |
 367184:<>dvikaM purANadvayam | evaMvidhaM niyamamatikramya
 370351:<>vikArAH evaM tadvikArabhedAnAM dadhyaGkurAdaya-
 372354:<>Atma0-dvika0-seT |) vethate | iti durgAdAsaH ||
 377836:<>samRdvIkArasaM kSaudraM varSAkAle virecanam ||
 380659:<>“vizvabheSajamRdvIkAcitrakairmUtrabhAvitaiH ||”
 381109:<>zarIrAvayavAn saukSmAt pravizedvikaroti ca |
 388256:<>mukhAkSikarNazIrSeSu zRGge skandhe dvikaM dvikam |
 389616:<>Atma0-dvika0-seT |) Ga, vethate | iti durgA-
 405389:<>mudrAGkitamedinIhemacandrayoH rephazUnyadvikakAra-
 412682:<>pAdonadvikaro'pi kiSkuruditazcApazcaturbhiH
 413215:<>prazItihastaM dvikareNa hInaM
 434415:<>mRdvIkAyAH palAnyatra catvAri kathitAni hi |
 447961:<>“vizuddhaM gaganaM grAhyaM dvikarSaM zuddhagandhakam |
 456003:<>rthAdimAsAnAM tulyavadvikalpaH | kintu pUrvvapUrvva-
 470152:<>zarIrAvayavAn saukSmyAt pravizedvikaroti ca ||
 476810:<>mRdvIkA hArahUrA ca gostanI cApi
 481596:<>navAdiguNayuktatvaM tathaikatra dvikarSatA |
 483213:<>mRdvIkA hArahUrA ca gostanI cApi
 483466:<>ISadvikAsi kathanaM smitaM syAt syanditAdharam |
 485105:<>“ekadvikadvAdazabhAgayuktaM
 485197:<>ubha0-dvika0-aniT |) hRtirdezAddezAntara-
 485320:<>jAgratastadvikazati svapatazca nimIlati ||”
funderburkjim commented 10 years ago

Re: 'Can you create four lists...'

Can we come to a consensus that 'dbika' is always an error (a digitization error which should be corrected)? I suspect it is a digitization error in this very special case.
Then, I can make some other lists of likely errors, along the lines you suggest.

By the way, Dhaval, do you do any programming?

gasyoun commented 10 years ago

Dhaval is PHP coder far more than I am, http://www.sanskritworld.in/ is his child. I would love to see that he shows you his sandhi machine, but oh he does know so many issue in the Sanskrit NLP world, a small proof of it http://is.gd/Wc9xPX

drdhaval2785 commented 10 years ago

QUOTE: "Can we come to a consensus that 'dbika' is always an error (a digitization error which should be corrected)? I suspect it is a digitization error in this very special case." UNQUOTE

In 'dbika' it is always an error.

QUOTE "Then, I can make some other lists of likely errors, along the lines you suggest." UNQUOTE

I am not talking about 'dbika' incident only. There are many cases in SKD, where 'b' has replaced 'v'. Shalu's concern is that the word let's say 'mRdvIkA' is lost to the reader because SKD uses 'mRdbIkA'. So a person who writes correct sanskrit won't be able to reach 'mRdvIkA'. What I would suggest is that we DO NOT replace the headword mRdbIkA to mRdvIkA. We should ADD a new headword mRdvIkA IN ADDITION to the existing mRdbIkA, so that right or wrong - both reach their dictionary entry. (The method to do so is explained in post 7 in this thread.)

I am not sure about the results - TAKE BACKUP, which I am sure you will. Best wishes.

QUOTE: "By the way, Dhaval, do you do any programming?" UNQUOTE I do PHP coding for Sanskrit NLP. The Github page is - https://github.com/drdhaval2785/sanskrit The frontend for the codes are: Sandhi machine - http://lanover.com/lan/sanskrit/sandhi.html Subanta generation machine - http://lanover.com/lan/sanskrit/subanta.html

These both machines display step by step derivation by Panini's grammar rules. Testing and corrections are on.

funderburkjim commented 10 years ago
  1. Corrections for all the 37 'dbika' (or dbIkA) in the list above have been made; changing to dvika (or dvIkA).
  2. This includes the headword mRdbIkA being changed to mRdvIkA. Here's why I made this change. I view it as a correction, in the sense that the scan actually has 'mRdvIkA' (albeit the Devanagari is very smudged in the scan), so the digitization is viewed as inconsistent with the scan, i.e., the digitization is viewed as being in error at this point. Ancillary evidence is that skd.txt actually has 'mRdvIkA' in 36 cases.
  3. I want to allay your concern that corrections might make some ghastly irreversible mistake to the digitization. When I say that "skd.txt has been corrected", what, technically, does that mean? The digitization is actually represented by a sequence of files.
    • The original digitization from Thomas Malten is skd_orig.txt. This file is never altered.
    • A conversion to use utf-8 encoding (rather than the now obscure cp1252 encoding of skd_orig.txt) is made, this file is skd_orig_utf8.txt. This file is never altered.
    • skd_v0.txt incorporates a small number of global changes to skd_orig_utf8.txt, These changes are based on a change_01.txt file (part of the skdxml downloadable).
    • Since there is a large section of about 1000 'preface' lines in the digitization of skd, I chose to split skd_v0..txt into two parts, skd_v1.txt and skd_preface.txt.
  4. skd.txt is made by applying further corrections to skd_v1.txt. These corrections are applied from lists of 'old'/'new' pairs in a file, currently named manualByLine.txt. For instance, when Shalu submits a correction to skd via the Sanskrit Lexicon Correction Form, I add a pair of old/new lines to manualByLine.txt. And to correct dbika to dvika, I added 37 pairs of old/new lines to manualByLine.txt. And to effectuate the changes, a rerun a program that changes skd_v1.txt to skd.txt based on the data in manualByLine.txt. (Then, several other programs are run so the changes will be visible in the displays.)

Since all of these steps are governed by programs, each step is reproducible and may be retroactively altered if required. So, for instance, if you later convinced me that the 'mRdbIkA' correction was wrong, I could revise manualByLine.txt and redo the update.

All further steps - notably creation of headword lists, creation of xml file, creation of sqlite database from xml file - are based on skd.txt.

gasyoun commented 10 years ago

I love to read your step by step guides. Mostly they do not leave anything unanswered in advance. Ancillary evidence is good thing. But the devanagari is not actually that bad at that exact place. Can we see the tools you run and rerun on github, please?

funderburkjim commented 10 years ago

Six lists of 'bd' words are now in this skd repository as requested by Dhaval several days ago.

  1. 25 bd-01-skdkeys.txt from skdhw2.txt, slp1. Headwords of skd whose spelling contains 'bd'
  2. 23 bd-01-ap90keys.txt from ap90hw2.txt, slp1 Headwords of ap90 whose spelling contains 'bd'
  3. 88 bd-01-mwkeys.txt from extract-keys_b.txt, slp1 Headwords of mw whose spelling contains 'bd'
  4. 1 bd-01-cons.txt Emacs filter of skd.txt. HK. text lines with 'bd-X', where X is a consonant other than 'r' or 'y'
  5. 466 bd-01-ry.txt Emacs filter of skd.txt. HK. text lines with 'bd-X', where X is 'r' or 'y'
  6. 5947 bd-01-vowel.txt Emacs filter of skd.txt. HK. text lines with 'bd-X', where X is a vowel

The next steps might be:

  1. compare file1 (skd headwords) to files 2 and 3, looking for headword errors in skd.
  2. Confirm suspected error in file 4.
  3. Look for errors in file 5.
  4. Separate file 6 into cases where there is or is not an error. In either case, group similar cases together to facilitate review by others.
    • When the two files (likely errors in file6, non-errors in file6) are stable, we can use some formal criteria to justify committing the lines in the 'error' file as corrections to skd .
drdhaval2785 commented 10 years ago

Jim, 'bd' combination is very easily possible in Sanskrit. SKD headwords in bd-01-skdkeys.txt are all correct.

We are looking for 'db' and not 'bd'. So you should replace 'bd' files with 'db' files to make them bear some fruit.

funderburkjim commented 10 years ago

Thanks, Dhaval !

The error has been corrected: all the files in repository are now 'db...' files, and contain 'db' content.

drdhaval2785 commented 10 years ago

Checked and corrected Headwords file. The corrected version can be seen at https://www.dropbox.com/s/je84cisrnf5qw86/db-01-skdkey_corrected.txt

For change comparision, please use https://www.dropbox.com/sh/y9im3cq4u515zhp/AAB4dzi7vgbLgaYOor3GRh6Na.

drdhaval2785 commented 10 years ago

One more way we can find more data entry error.

e.g. draupadI, dbandam, dbandbam -> when arranged in devanagari sorting, they will give dbandam, dbandbam, draupadI.

Meaning thereby -> we have to identify the differences in (1) the dictionary data entry order and (2) proper sorting order. Whatsoever is out of place, we will scrutinize them for the correctness.

e.g. In dictionary, draupadI, dbandam, dbandbam is data entry. But the text must be dvandam, dvandvam. Otherwise 'db' should have preceded 'dr' of draupadI, which is not the case here. So the writer meant dvandam, dvandvam only and not the 'db' version.

funderburkjim commented 10 years ago
  1. Finally implemented Dhaval's corrections to db-01-skdkeys. The correction transactions are in db-01-skdkeys-updates.txt in this repository. These transactions are changes to the lines of skd.txt containing the headword. Within these lines, some other 'db'->'dv' changes were also made.

    I wasn't sure what the 'change-comparison' program TextDiff was for. Does it do something that Unix 'diff' or Windows 'fc' do not?

  2. All of skd.txt is now in slp1 transliteration. Formerly, skd.txt was in HK, but the keys were in slp1. Now everything is slp1, which simplifies things.

    The other filter files also were remade to be all slp1.

  3. I like the idea of using variations from alphabetical order. The skdhw2_chksort.txt file contains deviations from alphabetical order in skd headwords. No doubt resolving these deviations would generate further corrections (there are 815 deviations in this file).
  4. My next task will be to develop further corrections from the db-01-vowel.txt file (5000+ lines).

    And one subtask will be to identify cases where the 'stem' form of one of the skdkeys corrections appears misspelled in some line of the file.

For instance, from misspelled headword 'dbandaM', take as stem 'dbanda'. Find all lines in db-01-vowel with 'dbanda'(there is just 1); and changing that line will generate a correction. Do this for all the corrected headwords.

Another subtask will be to remove some non-errors. The first one I notice is [yt]adb, which has 977 cases, I'll look at each. (for instance 'yadbaDnanti' is not an error).

gasyoun commented 10 years ago

I've never used Windows "fc", but TextDiff is great for comparing dictionary word lists with a UI that I can understand and export results in a way I can read. 815 deviations is a big number indeed, great news. Otherwise waiting Dhaval's comments.

drdhaval2785 commented 10 years ago

Correcting the wrong sorted words. Leaving the correct spellings as such. Correcting the incorrect ones. Have reached till 1-100 pages. Sample corrected version is as below:

1-001:aH:41,47 !< 1-001:afRI:48,55 1-001:aMSumatPalA:112,114 !< 1-001:aMSumatI:115,118 1-002:akalkanaH:279,282 !< 1-003:akalkA:283,286 1-004:akravyAdaH:485,489 !< 1-004:akramaH:490,497 1-005:akzaraH:725,730 !< 1-005:akzaraM:731,750 1-006:akzaravinyAsaH:791,792 !< 1-006:akzaramuKaH:793,794 1-006:akzIbaH:856,857 !< 1-006:akzIbaH:858,858 1-007:agarhitaH:1026,1030 !< 1-007:agaru:1031,1032 1-008:agADaH:1088,1091 !< 1-008:agADaM:1092,1092 1-009:agnicit:1291,1296 !< 1-009:agnijaH:1297,1298 1-009:agnijvAlA:1313,1316 !< 1-009:agnijihvA:1317,1325 1-009:agnivardDanaM:1363,1367 !< 1-009:agniBaM:1368,1369 1-010:agnivAhaH:1418,1419 !< 1-010:agnibAhuH:1420,1421 1-012:agresarikaH:1798,1800 !< 1-012:agyraH:1801,1803 1-013:aGnyaH:1865,1867 !< 1-013:aGnyA:1868,1871 1-014:aNgAt:2077,2082 !< 1-014:aNganApriyaH:2083,2087(Here the definition of aNganA has been erroneously split. aNgAt is not a separate word.) 1-015:aNgAratElaM:2192,2200 !< 1-015:aNgArakamaRiH:2201,2206 1-016:aNgulIpaYcakaM:2338,2342 !< 1-016:aNgulIyaH:2343,2344 1-019:ajInapatrI:2854,2855 !< 1-019:ajinayoniH:2856,2858 1-022:ajyezWavfttiH:3397,3407 !< 1-023:ajEkapAt:3408,3413 1-023:ajJalaM:3436,3439 !< 1-023:ajYaH:3440,3447 1-023:iti:3495,3503 !< 1-023:aYjanaH:3504,3510 (Definition of aYjanaM has been erroneously split. iti is not a separate word.) 1-023:anja:3532,3535 !< 1-023:aYjaliH:3536,3541 (Definition of aYjanI has been erroneously split.) 1-025:ataH:3786,3788 !< 1-025:ataeva:3789,3794 1-026:atipanTAH:4014,4016 !< 1-026:atipatraH:4017,4018 1-027:atiBAragaH:4043,4047 !< 1-027:atiBIH:4048,4049 (Here the original entry atitaBIH is wrong. it should be atiBI) 1-029:ayamarTaH:4481,4643 !< 1-030:atisArakI:4644,4646 (Wrong split in definition of ati. ayamarTaH is not a separate word.) 1-030:atisArakI:4644,4646 !< 1-030:atisAmyA:4647,4650 1-031:atyantagAmI:4749,4752 !< 1-031:atyantaH:4753,4754 1-032:atrinetrajaH:4867,4867 !< 1-032:atrinetraprasUtaH:4868,4869 1-035:adButasvanaH:5352,5354 !< 1-035:adButasAraH:5355,5356 1-035:adButasAraH:5355,5356 !< 1-035:admaniH:5357,5358 1-036:aDaHpuzpI:5475,5480 !< 1-036:aDaHkziptaH:5481,5482 1-038:aDimAsaH:5756,5762 !< 1-038:aDimAMsakaH:5763,5770 1-039:aDizWAnaM:5991,5996 !< 1-039:aDikziptaH:5997,6000 1-040:aDo'MSukaM:6059,6060 !< 1-040:aDoGaRwA:6061,6062 1-040:aDovAyuH:6082,6085 !< 1-040:aDokzajaH:6086,6090 1-040:aDyaSanaM:6108,6110 !< 1-040:aDyakzaH:6111,6114 1-040:aDyuzwraH:6169,6170 !< 1-040:aDyuQaH:6171,6173 1-040:aDvajA:6194,6196 !< 1-040:aDvanInaH:6197,6198 1-045:anahaNkftiH:6918,6921 !< 1-045:anakzaH:6922,6923 1-046:animezaH:7140,7141 !< 1-046:aniyataM:7142,7146 1-047:anirvviRRaH:7207,7213 !< 1-047:anirvftiH:7214,7217 1-047:anilAmayaH:7230,7231 !< 1-047:anirlocitaH:7232,7236 1-047:anizpannaH:7258,7262 !< 1-047:anikzuH:7263,7269 1-048:anugavInaH:7407,7409 !< 1-048:anugamaH:7410,7416 1-050:anuBUtiH:7687,7689 !< 1-050:anuBUtAdyavismftiH:7690,7694 1-052:anuhAra:7979,7983 !< 1-052:anukzaRaM:7984,7986 1-054:antaritaM:8257,8262 !< 1-054:antarikzaM:8263,8269 1-054:antarIyaM:8273,8278 !< 1-054:antarIkzaM:8279,8282 1-061:anvazwakA:9437,9456 !< 1-061:anvakzaH:9457,9459 1-064:aparatvaM:9810,9817 !< 1-064:aparatiH:9818,9821 1-065:aparyyuzitaM:10052,10056 !< 1-065:aparvvadaRqaH:10057,10059 1-068:apAstaM:10514,10517 !< 1-068:apAkzaM:10518,10523 1-069:apeyaM:10611,10617 !< 1-069:apekzaRIyaM:10618,10621 1-070:apratyayaH:10760,10766 !< 1-070:apratyakzaH:10767,10772 1-071:abjayoniH:10987,10989 !< 1-071:abjavAhanaH:10990,10991 1-072:aBayA:11086,11088 !< 1-072:aBakzyaM:11089,11102 1-075:aBizavaH:11640,11646 !< 1-075:aBizavaM:11647,11648 1-078:aBIzwA:12041,12046 !< 1-078:aBIkzRaM:12047,12055 1-078:aByaNgaH:12092,12109 !< 1-078:aByaNkzaH:12110,12111 1-079:aByuditaH:12263,12271 !< 1-079:aByupagataH:12272,12278 1-081:ama:12495,12496 !< 1-081:am:12497,12498 1-083:amAyikaH:12801,12803 !< 1-083:amAvasI:12804,12806 1-085:amUrttaH:13203,13218 !< 1-085:amUdfSaH:13219,13223 1-086:amftavallI:13280,13281 !< 1-086:amftarasA:13282,13291 1-091:ayAnayInaH:14136,14139 !< 1-091:ayantritaH:14140,14145 1-092:ayuktaM:14194,14206 !< 1-092:ayugmacCadaH:14207,14214 1-093:Gawwa:14418,14422 !< 1-093:araGawwakaH:14423,14424 (Wrong split of araGawwaH. Gawwa is not a separate word here.) 1-093:arawuH:14429,14430 !< 1-093:araRiH:14431,14432

drdhaval2785 commented 10 years ago

Few observations on the wrong sorting. There are some conventions which SKD followed, which is not proper according to sanskrit sorting conventions.

  1. SKD places visarga before anusvAra. e.g. 1-005:akzaraH:725,730 !< 1-005:akzaraM:731,750
  2. SKD sorts 'v' at places where 'b' should be there. e.g. 1-009:agnivardDanaM:1363,1367 !< 1-009:agniBaM:1368,1369
  3. SKD treats 'kz' as a separate consonant, and sorts it after 'l'. e.g. 1-045:anahaNkftiH:6918,6921 !< 1-045:anakzaH:6922,6923
drdhaval2785 commented 10 years ago

Just an interim reply - Correcting reached to page 2-128 of scans. 25% error list checked. Will take some time before this can be corrected fully.

gasyoun commented 10 years ago

So it's possible until end of 2014, great news.

funderburkjim commented 10 years ago

Just an interim reply from me, too on the 'db' cases, within text of SKD. Of the 5906 cases in db-01-vowel.txt (this repository), I've classified 669 as requiring no change, 4455 as requiring change (from 'db' to 'dv'), and 782 are still unclassified.

funderburkjim commented 9 years ago

I decided to upload the interim work done on identification of potential 'db' spelling errors in the SKD text; see https://github.com/sanskrit-lexicon/SKD/blob/master/vowelwork-confident-summary.org and https://github.com/sanskrit-lexicon/SKD/blob/master/step1_4c_upd.

This deals with potential 'db' problems within the text of SKD. The file contains suggested changes that ejf felt confident about at the time.
The work is incomplete, and the suggested changes have not been made. The work may be helpful if this 'db' text issue is taken up again at some time.

The step1_4c_upd file contains detailed old/new records summarized in vowelwork-confident-summary file.

It was probably premature to spend so much time on this now.

I suggest we close this issue now.