alphabetizing errors in skd headwords

funderburkjim commented 9 years ago

Shalu posted a correction where two headwords 'nu' should be 'tanu'. The list display (before correction) showed tatu, nu, nu, tanukṣīra. These two have been corrected (to tanu).

Clearly, the 'nu' headwords here were out of alphabetical order.

At some point in working with SKD, I had written a program to check errors in alphabetical ordering of headwords with SKD. The file 'skdhw2_chksort.txt' (uploaded to this repository) identifies 795 cases of headwords out of alphabetical order. Probably most of these are due to a digitization error. There remains the task :sweat: of finding corrections for these.

gasyoun commented 9 years ago

795 is not a small number. Could you provide the list of them, or are you sure they are exactly digitization errors? I would implement the change for sure, but would love to see the changelog.txt file on github here as well, thanks.

Shalu411 commented 9 years ago

795 might be big number, but finding corrections for them is not a big deal once the list is made. I myself had saw many such instances in list display- but never noticed why it happened. This time it was that very word- so it got caught. I use list display always.. for sake of devanagari. Please provide me list.

funderburkjim commented 9 years ago

Re: "Could you provide the list of them?" This is the file 'skdhw2_chksort.txt' in this SKD repository.

To find the corrections will likely require consulting the scans. That's why 795 is not a small number. If you imagine 1 minute per instance, that would be 13 hours.

If anyone wants to volunteer to find these corrections, I might be able to assist by providing some custom displays to make the workflow for the task efficient.

gasyoun commented 9 years ago

Shalu is a PhD student as I am. My PhD should be over on 13 Nov 2014. As Shalu is even less github savvy, I guess a link https://github.com/sanskrit-lexicon/SKD/blob/master/skdhw2_chksort.txt is better, than just the name of it :)

Shalu411 commented 9 years ago

"Shalu is even less github savvy.." Oh.. right- I am. "https://github.com/sanskrit-lexicon/SKD/blob/master/skdhw2_chksort.txt is better, than just the name of it" True. Thanks.. for the link. Will look at the file -sure.. as time permits.

funderburkjim commented 9 years ago

Glad to see Dhaval's useful work on skd alphabetizing errors. I'll probably wait til Dhaval is finished before implementing changes on Cologne site.

Let's put that work under this 'alphabetizing errors in skd headwords' issue. I've copied Dhaval's last two comments (posted under issue #3) to here.

funderburkjim commented 9 years ago

This is a copy of Dhaval's first alphabetization errors list: (Dhaval's convention for corrections is to put a correction note in parentheses at end of record. So, there are 6 corrections mentioned in the list below.)

Correcting the wrong sorted words. Leaving the correct spellings as such. Correcting the incorrect ones. Have reached till 1-100 pages. Sample corrected version is as below:

1-001:aH:41,47 !< 1-001:afRI:48,55 1-001:aMSumatPalA:112,114 !< 1-001:aMSumatI:115,118 1-002:akalkanaH:279,282 !< 1-003:akalkA:283,286 1-004:akravyAdaH:485,489 !< 1-004:akramaH:490,497 1-005:akzaraH:725,730 !< 1-005:akzaraM:731,750 1-006:akzaravinyAsaH:791,792 !< 1-006:akzaramuKaH:793,794 1-006:akzIbaH:856,857 !< 1-006:akzIbaH:858,858 1-007:agarhitaH:1026,1030 !< 1-007:agaru:1031,1032 1-008:agADaH:1088,1091 !< 1-008:agADaM:1092,1092 1-009:agnicit:1291,1296 !< 1-009:agnijaH:1297,1298 1-009:agnijvAlA:1313,1316 !< 1-009:agnijihvA:1317,1325 1-009:agnivardDanaM:1363,1367 !< 1-009:agniBaM:1368,1369 1-010:agnivAhaH:1418,1419 !< 1-010:agnibAhuH:1420,1421 1-012:agresarikaH:1798,1800 !< 1-012:agyraH:1801,1803 1-013:aGnyaH:1865,1867 !< 1-013:aGnyA:1868,1871 1-014:aNgAt:2077,2082 !< 1-014:aNganApriyaH:2083,2087(Here the definition of aNganA has been erroneously split. aNgAt is not a separate word.) 1-015:aNgAratElaM:2192,2200 !< 1-015:aNgArakamaRiH:2201,2206 1-016:aNgulIpaYcakaM:2338,2342 !< 1-016:aNgulIyaH:2343,2344 1-019:ajInapatrI:2854,2855 !< 1-019:ajinayoniH:2856,2858 1-022:ajyezWavfttiH:3397,3407 !< 1-023:ajEkapAt:3408,3413 1-023:ajJalaM:3436,3439 !< 1-023:ajYaH:3440,3447 1-023:iti:3495,3503 !< 1-023:aYjanaH:3504,3510 (Definition of aYjanaM has been erroneously split. iti is not a separate word.) 1-023:anja:3532,3535 !< 1-023:aYjaliH:3536,3541 (Definition of aYjanI has been erroneously split.) 1-025:ataH:3786,3788 !< 1-025:ataeva:3789,3794 1-026:atipanTAH:4014,4016 !< 1-026:atipatraH:4017,4018 1-027:atiBAragaH:4043,4047 !< 1-027:atiBIH:4048,4049 (Here the original entry atitaBIH is wrong. it should be atiBI) 1-029:ayamarTaH:4481,4643 !< 1-030:atisArakI:4644,4646 (Wrong split in definition of ati. ayamarTaH is not a separate word.) 1-030:atisArakI:4644,4646 !< 1-030:atisAmyA:4647,4650 1-031:atyantagAmI:4749,4752 !< 1-031:atyantaH:4753,4754 1-032:atrinetrajaH:4867,4867 !< 1-032:atrinetraprasUtaH:4868,4869 1-035:adButasvanaH:5352,5354 !< 1-035:adButasAraH:5355,5356 1-035:adButasAraH:5355,5356 !< 1-035:admaniH:5357,5358 1-036:aDaHpuzpI:5475,5480 !< 1-036:aDaHkziptaH:5481,5482 1-038:aDimAsaH:5756,5762 !< 1-038:aDimAMsakaH:5763,5770 1-039:aDizWAnaM:5991,5996 !< 1-039:aDikziptaH:5997,6000 1-040:aDo'MSukaM:6059,6060 !< 1-040:aDoGaRwA:6061,6062 1-040:aDovAyuH:6082,6085 !< 1-040:aDokzajaH:6086,6090 1-040:aDyaSanaM:6108,6110 !< 1-040:aDyakzaH:6111,6114 1-040:aDyuzwraH:6169,6170 !< 1-040:aDyuQaH:6171,6173 1-040:aDvajA:6194,6196 !< 1-040:aDvanInaH:6197,6198 1-045:anahaNkftiH:6918,6921 !< 1-045:anakzaH:6922,6923 1-046:animezaH:7140,7141 !< 1-046:aniyataM:7142,7146 1-047:anirvviRRaH:7207,7213 !< 1-047:anirvftiH:7214,7217 1-047:anilAmayaH:7230,7231 !< 1-047:anirlocitaH:7232,7236 1-047:anizpannaH:7258,7262 !< 1-047:anikzuH:7263,7269 1-048:anugavInaH:7407,7409 !< 1-048:anugamaH:7410,7416 1-050:anuBUtiH:7687,7689 !< 1-050:anuBUtAdyavismftiH:7690,7694 1-052:anuhAra:7979,7983 !< 1-052:anukzaRaM:7984,7986 1-054:antaritaM:8257,8262 !< 1-054:antarikzaM:8263,8269 1-054:antarIyaM:8273,8278 !< 1-054:antarIkzaM:8279,8282 1-061:anvazwakA:9437,9456 !< 1-061:anvakzaH:9457,9459 1-064:aparatvaM:9810,9817 !< 1-064:aparatiH:9818,9821 1-065:aparyyuzitaM:10052,10056 !< 1-065:aparvvadaRqaH:10057,10059 1-068:apAstaM:10514,10517 !< 1-068:apAkzaM:10518,10523 1-069:apeyaM:10611,10617 !< 1-069:apekzaRIyaM:10618,10621 1-070:apratyayaH:10760,10766 !< 1-070:apratyakzaH:10767,10772 1-071:abjayoniH:10987,10989 !< 1-071:abjavAhanaH:10990,10991 1-072:aBayA:11086,11088 !< 1-072:aBakzyaM:11089,11102 1-075:aBizavaH:11640,11646 !< 1-075:aBizavaM:11647,11648 1-078:aBIzwA:12041,12046 !< 1-078:aBIkzRaM:12047,12055 1-078:aByaNgaH:12092,12109 !< 1-078:aByaNkzaH:12110,12111 1-079:aByuditaH:12263,12271 !< 1-079:aByupagataH:12272,12278 1-081:ama:12495,12496 !< 1-081:am:12497,12498 1-083:amAyikaH:12801,12803 !< 1-083:amAvasI:12804,12806 1-085:amUrttaH:13203,13218 !< 1-085:amUdfSaH:13219,13223 1-086:amftavallI:13280,13281 !< 1-086:amftarasA:13282,13291 1-091:ayAnayInaH:14136,14139 !< 1-091:ayantritaH:14140,14145 1-092:ayuktaM:14194,14206 !< 1-092:ayugmacCadaH:14207,14214 1-093:Gawwa:14418,14422 !< 1-093:araGawwakaH:14423,14424 (Wrong split of araGawwaH. Gawwa is not a separate word here.) 1-093:arawuH:14429,14430 !< 1-093:araRiH:14431,14432

funderburkjim commented 9 years ago

This is a copy of Dhaval's first comment on alphabetization conventions in SKD:

Few observations on the wrong sorting. There are some conventions which SKD followed, which is not proper according to sanskrit sorting conventions.

SKD places visarga before anusvAra. e.g. 1-005:akzaraH:725,730 !< 1-005:akzaraM:731,750 SKD sorts 'v' at places where 'b' should be there. e.g. 1-009:agnivardDanaM:1363,1367 !< 1-009:agniBaM:1368,1369 SKD treats 'kz' as a separate consonant, and sorts it after 'l'. e.g. 1-045:anahaNkftiH:6918,6921 !< 1-045:anakzaH:6922,6923

funderburkjim commented 9 years ago

These SKD conventions Dhaval is observing are good. When the task is finished, it should be possible to have a complete explanation of all the skd alphabetization exceptions (perhaps many will be anomalous, i.e., have no obvious explanation).

Then, this explanation might be added as a document in the Developer documentaion for SKD, since it would be useful for other researchers to know what Dhaval is discovering.

Here is a link to the full list of keys for skd (as of today) in case it should be needed https://dl.dropboxusercontent.com/u/29859999/skdkeys-20140815.zip. This list is normally part of the skdxml download on the skd downloads page of Cologne web site.

gasyoun commented 9 years ago

It's great to see, that lessons learned at https://groups.google.com/forum/#!topic/sanskrit-programmers/HTyINaNbvUQ are not lost and php sorting software can be used in several ways. Dhaval's work is of huge interest and importance.

funderburkjim commented 9 years ago

Dhaval - could you remind me where your work stands on this issue of alphabetization errors in SKD?

As I understand it, there is still work to be done here?

Do you think that this approach will find some headword spelling errors in SKD?

The reason for the question is, that Sampada has a paper SKD but in a month or so she is moving and may not have the paper SKD. So, this might be a good project for her now. What do you think?

funderburkjim commented 9 years ago

re 1-029:ayamarTaH:4481,4643 !< 1-030:atisArakI:4644,4646 (Wrong split in definition of ati. ayamarTaH is not a separate word.)

I can't confirm that. So, for now am not changing.

The Sanskrit is too difficult for me to understand the context. But, assuming 'ayamarTaH' means "This is the meaning", perhaps what follows ayamarTaH is an explanation of the quote(s) preceding ayamarTah. If such is the case, then I'll make the correction.

The other 5 corrections in the list above (the list starting with '1-001:aH:41,47 !< 1-001:afRI:48,55' ) have been made today.

gasyoun commented 9 years ago

@funderburkjim I could try to make a fuzzy list of possible list of SKD in a week. Could compare it to VCP or MW. If Sampada is up to SKD - it's great, because otherwise the Indian origin dictionaries are even in a worse condition, than the others.

funderburkjim commented 9 years ago

Sampada may be working on something else for Peter, as I haven't heard from her in several days.

If she finishes the mis-alphabetization cases in SKD, I was thinking about asking her check my work on a rather long list of textual 'db' corrections, so that task begun in July could be brought to a close.

I presume your 'fuzzy list of possible list of SKD' means generating a list of possible spelling errors in headwords in SKD? Perhaps it would be better to focus on headword corrections before getting into the 'db' textual corrections. What do others think regarding priority (headwords v. 'db')?

gasyoun commented 9 years ago

@funderburkjim I'm advocate of headword priority. Am I the only one?

funderburkjim commented 9 years ago

Headwords are the main gateway to users of a traditional paper dictionary. Similarly, to users of the Cologne displays (with the exception of the Advanced Search 'Text' searches). This observation makes a strong case for the priority of getting the headwords right.

However, once the 'major' headwords are right (major in terms of estimated frequency of user inquiry), then further correction of 'obscure' headwords probably has no higher priority than corrections of egregious spelling errors in the text. For example, if a user of CCS looks up vidyAvid and sees as definition the misspelled 'wssenskundig', surely that experience is at least slightly unpleasant. So, text errors are important, too.

So, that's my little two-step dance (one step forward, one step back) on the issue.

gasyoun commented 9 years ago

I would like to cleanup CCS only when 98% of the headwords are right. When do I know they are right? When I do not find new issues for a few weeks. So I would not worry much about wssenskundig, because people who use CCS do understand German good enough. It's a small dictionary and only for limited use in digital word. It was popular as a printed book. And never will be popular as a digital source because of it's limitations.

funderburkjim commented 9 years ago

@drdhaval2785

Dhaval - I think I misinterpreted your alphabetization list of corrections (see my comment above of Aug. 15) .

I had thought that only the words with parentheses needed correction.

HOWEVER, I now think that the list above contains all records from skdhw2_chksort.txt up through arawuH, which is the last record with page number less than 100. And, that the corrections have been made 'silently', in 15 lines (in addition to the 6 lines with parentheses).

Does this sound right?

funderburkjim commented 9 years ago

Sampada is now working on the alphabetization errors in SKD. The corrections in Dhaval's list above with parenthetical comment have already been entered by me. Not counting these, there are 812 cases to consider.

The first 65 of these cases correspond to ones Dhaval has already checked. A comparison of Sampada's and Dhaval's solutions for these 65 resulted in agreement except for 2 cases:

case 0007: akzIvaH !< akzIyaH   
  (dhaval akzIbaH,akzIbaH)
  (sampada, akzIvaH, akzIvaH)

ejf:  Currently accept change as akzIvaH, akzIvaH

case 0049: aparyyuzitaM !< aparbbadaRqaH
  dhaval: aparvvadaRqaH
 sampada: no change

ejf: Currently accept change as aparvvadaRqaH

One meta observation is that it was good to have two sources of correction, as in 5 cases Sampada revised her corrections based on Dhaval's.

Next are some observations made by Sampada, Peter, and me regarding these two cases.

case 0007: akzIvaH !< akzIyaH   . Everyone agrees that akzIyaH   is wrong.  But should it (and the
prior word) have a 'v' akzIvaH or a 'b' akzIbaH?

Sampada: I have changed the word to <akzIvaH> because according to me entry <akzIyaH> is wrong. 
Scan shows <akzIvaH> clearly. But meaning tallies with <akzIbaH>. Letter 'va' very clear in scan and 
often 'va' and 'ba' replace eachother. But Dhaval's <akzIbaH> entry seems right. Obviously, if <ba> is 
right then alphabetical order in scan is definitely wrong.
Peter: akIbaH akIbaH
This one is hard to tell from the dict. itself.  The Madhaviya Dhatuvrtti has both roots kzIb and kziv.  
But the first is given with the meaning 'made' (hence SKD mAdyati as a gloss), and has the marker 'f'.  
By this the previous entry too should be akzIbaM not akzIvaM

Jim:
      a.  The scan is clearly a 'v' rather than a 'b'
      b.  Either 'v' or 'b' fits into the alphabetical ordering of neighboring words.
      c.  SKD has a kzIba as headword, with alternate 'kzIva' implied; so we could
           conclude that the author was using the alternate 'v' in akzIvaH, as in the scan.
<HI>kzIba, (va) f Na made . iti kavikalpadrumaH . (BvAM-
<>AtmaM-akaM-sew .) made mattIBAve . f, acikzI-
<>bat . Na, kzIbate madyapaH . iti durgAdAsaH ..
     d. There is a headword AkzIva:
<HI>AkzIva, puM, (AN + kzIva + ac .) akzIvaH .
<>SoBAYjanavfkzaH . iti rAyamukuwaH ..

case 0049: aparyyuzitaM !< aparbbadaRqaH

Sampada: I didn't see a reason to change the word 'aparbbadaRqaH'. It is very clear in the scan. 
Also, the word 'aparbba' is used like this twice within the explanation.
Peter: aparvvadaRqaH - alphabetic order tells here that it is 'v' not 'b' because it is after 'y'.  Also the 
compound element parvan is a well-known word (it is not 'parban')
Jim: 
      a. I agree with Sampada that the scan clearly shows 'b', but
      b. Alphabetical ordering of nearby headwords clearly favors 'v'
      c. Another factoid:  Searching through the entire digitization skd.txt (not only headwords),
          'parbb' occurs 280 times and 'parvv' occurs 2230 times.
          - It is a separate question whether we should change all those 'parbb' to 'parvv', even if
            the text is not consistent.

gasyoun commented 9 years ago

As per It is a separate question whether we should change all those 'parbb' to 'parvv', even if the text is not consistent. sure I would not touch these, but some interlinking markup does makes sense.

funderburkjim commented 9 years ago

But, haven't we already done some changes with the 'db' headwords that are analogous to changing 'parbb' to 'parvv' ? And, I have a large list of textual 'db' changes that I would like to make (like 'dbAdaSa' -> 'dvAdaSa', As long as we document such changes, such as in the history file for SKD, it seems to me ok to do this.

However, I am also agreeable to deferring such non-headword changes while there is still obvious work to do on headword changes.

gasyoun commented 9 years ago

The fact is that at least in the ayurvedic field several common terms even inside a single critical edition text occur with both v and b, so killing one in our list would not help much, it would raise even more issues. As per But, haven't we already done some changes with the 'db' headwords that are analogous to changing 'parbb' to 'parvv' ? as I remember - yes, we have and that's where the ground is not safe under our feet. I mean I love to change and modify something, that is not mine, still there are practical reasons not always to do so, I guess.

funderburkjim commented 9 years ago

Peter reviewed my comments on akzIbaH v. akzIvaH, and reaffirmed akzIbaH:

Peter:
My conclusion that it should be akzIbaH akzIbaH and that the preceding entry should be akzIbaM, is 
not a light conclusion to be overridden by what the scan looks like.  The compiler of the SKD was a 
competent Sanskrit scholar who would have known which root occurs in the meaning made. That root 
kzIb and not the other kzIv is the one he intended by his use of the gloss mAdyati.  'b' and 'v' are 
confusable especially in Bengal and what the scan looks like between these two bears very little 
weight.

Since Peter is strongly in favor of akzIbaH and that is what Dhaval also suggested, I'm changing my mind and using akzIbaH.

gasyoun commented 9 years ago

So no more hard cases left out for now? Sounds great, let's see what Sampada will find next. Peter answers quickly indeed and in a way we can hardly argue with him. @funderburkjim do you understand how to use Dhaval's multisorter to see what additional entries might be at the (possibly) wrong place?

funderburkjim commented 9 years ago

Sampada (with some input from ejf and Peter), examined all the headword alphabetical misorderings and makes 269 headword changes. For full details, see files in https://github.com/sanskrit-lexicon/CORRECTIONS/tree/master/dictionaries/SKD directory. This substantially extends the work begun by Dhaval.

funderburkjim commented 9 years ago

@gasyoun re 'So no more hard cases left out for now?' There are still alphabetical misorderings, but probably most of these are due to some intention of the author. There might be a few principles that would explain the misorderings. Someone fluent in Sanskrit might find it an interesting exercise to discover these principles.

re 'let's see what Sampada will find next' : Sampada's working on the alphabetical misorderings in VCP now.

re 'Dhaval's multisorter' Don't know what this is.

gasyoun commented 9 years ago

Dhaval's multisorter is all about a collection of PHP scripts at https://github.com/drdhaval2785/SanskritSorting including multi13.php that can sort the right way, so have a comparison. Input SLP1, works with CMD as well now.

drdhaval2785 commented 9 years ago

@funderburkjim and @gasyoun Sorry to miss so much of discussion. I was not following this repository earlier. Also, was not mentioned. Therefore missed it. Will have a look at this later if time permits.

sanskrit-lexicon / SKD

alphabetizing errors in skd headwords #2