sanskrit-lexicon / CORRECTIONS

Correction history for Cologne Sanskrit Lexicon
8 stars 5 forks source link

Alphabetic misordering #293

Open drdhaval2785 opened 8 years ago

drdhaval2785 commented 8 years ago

While sending correction submissions, alphabetic misorderings pop up off and on. There has not been a generic examination of such cases in near past. This issue is devoted to create some common code which can be run over any dictionary and find out alphabetically misordered headwords. @funderburkjim may like to share any such code, if one exists.

E.g. In viSvaSfjas headword, the preceding and subsequent ones have 'viSvasfj' in the start. So, this is alphabetically mosordered. capture

gasyoun commented 8 years ago

@drdhaval2785 as there can be different sorting modes, maybe your sorting code is the one which will solve the issue after @funderburkjim shares his default tester code?

drdhaval2785 commented 8 years ago

Here is my rough code for finding alphabetic mismatched headwords. The output can be seen in this directory.

The logic is -

  1. Extract dictionarywise headword:L-num pair from sanhw2.txt and put them in withlnum directory e.g. MCIwithlnum.txt file has aMSAvataraRa:629.
  2. Read these in array.
  3. Ascertain the entries where the later entry has L-number which is lesser than that of previous one. When such a pair is found write them in the following format to mismatch directory.
;previousword:previouslnum
word:lnum

e.g.

;aDfzyA:983
anaGa:7

Here, alphabetically sorted, aDfzyA should be sorted after anaGa, but it is not the case here. anaGa has L-num 7 and the other one 983. There can be many reasons for such mismatch. Needs to be examined.

drdhaval2785 commented 8 years ago

@funderburkjim and @gasyoun Just to demonstrate the usefulness of the approach, I am doing a sample examination of the first 10 pairs of misordering of BEN dictionary.

;aMsa:8
aMsamaYja:1291
;aMsala:9
aMh:10
;aNgAraka:99
aNgin:100
;aBimevana:851
aBiyAcanA:793
;arogyatA:998
aronitA:997
;ark:999
arka:1000
;AMvantya:1726
AkatTana:1340
;Adraya:1691
AdrisAra:1512
;Avaha:1737
AvADa:1583
;AsyA:1818
AsvaRqala:1379

The results are as follows

aMsamaYja:asamaYja:t:
aBimevana:aBisevana:p:
aronitA:arogitA:t:
AMvantya:Avantya:p:
Adraya:Ardraya:t:
AvADa:AbADa:p:
AsvaRqala:AKaRqala:t:

Thus total of 7 errors in 10 pairs (20 words). 35% is decent enough result for our perspective.

Hope this approach can be taken up after trigram examinations are over.

drdhaval2785 commented 8 years ago

Log file has the following details. We can take up smaller number dicts first and then move to higher error ones.

Found 6498 mismatched words in ACC dictionary
Found 468 mismatched words in CAE dictionary
Found 0 mismatched words in AE dictionary
Found 3028 mismatched words in AP90 dictionary
Found 3637 mismatched words in AP dictionary
Found 95 mismatched words in BEN dictionary
Found 79 mismatched words in BHS dictionary
Found 83 mismatched words in BOP dictionary
Found 0 mismatched words in BOR dictionary
Found 340 mismatched words in BUR dictionary
Found 312 mismatched words in CCS dictionary
Found 69 mismatched words in GRA dictionary
Found 48 mismatched words in GST dictionary
Found 2649 mismatched words in IEG dictionary
Found 3062 mismatched words in INM dictionary
Found 29 mismatched words in KRM dictionary
Found 860 mismatched words in MCI dictionary
Found 93 mismatched words in MD dictionary
Found 5825 mismatched words in MW72 dictionary
Found 23960 mismatched words in MW dictionary
Found 0 mismatched words in MWE dictionary
Found 293 mismatched words in PD dictionary
Found 2331 mismatched words in PE dictionary
Found 234 mismatched words in PGN dictionary
Found 520 mismatched words in PUI dictionary
Found 8038 mismatched words in PWG dictionary
Found 295 mismatched words in PW dictionary
Found 910 mismatched words in SCH dictionary
Found 1075 mismatched words in SHS dictionary
Found 572 mismatched words in SKD dictionary
Found 110 mismatched words in SNP dictionary
Found 93 mismatched words in STC dictionary
Found 801 mismatched words in VCP dictionary
Found 92 mismatched words in VEI dictionary
Found 758 mismatched words in WIL dictionary
Found 1411 mismatched words in YAT dictionary
drdhaval2785 commented 8 years ago

Some cursory analysis of various dictoinaries (for future betterring)

  1. Remove addenda and corrigenda pages from all dictionaries. They give large amount of false positives. @funderburkjim will need to give list of dictionaries which have sizeable number of addenda corrigenda. @funderburkjim should provide L-nums of the word just before the addenda pages for each dictionary. (This will take care of point 8 too).
  2. ACC - There are three volumes of ACC. This causes aMSumattantra to have 41686 L-number, even though it should come before aMSumadBedasaMgraha (L-num 4). If we get some way to segregate these three volumes into different headword sets and rerun the code, we will get more workable list than the present 12,996 entries.
  3. AP90, AP, WIL. They are highly erratic when it comes to headwords ordering. See this AP90 page and this AP page. The dictionary AP90 has aMSakaH, aMSala, aMSanaM, aMSayitf. The proper sorted item should be aMSakaH, aMSanaM, aMSayitf, aMSala. I am not sure why such a gross misordering is there. One possible explanation may be - Sort by verb, and second sort by pratyaya.
  4. BEN, BHS, BOP, BUR, CAE, CCS, GRA, GST, PD, PE, STC - Most are errors. No adjustments needed
  5. IEG, INM, PUI - These dictionaries doesn't distinguish between letter with diacritic and without diacritic when sorting. e.g. See this IEG page and this INM page. In IEG, akAlika and aKaRqadIpa have been clubbed in between headwords starting with 'A'. This would need some kind of refactoring to ignore such false positives.
  6. KRM - Here the headwords are verbs. Sorting is erratic. But not many errors. Can be skipped.
  7. MCI - This dictionary has separate parts like 1.1, 1.2, 1.3 etc. chapters. Each chapter starts with 'a'. This would need some adjustments.
  8. MD, MW72, PW, PWG, SCH, SNP, YAT - Exclude addenda MD, MW72, PW, PWG headwords from list. (Programmatically this would require identification of the L-num of last word of proper book. Any L-num which is more than that number should be ignored).
  9. MW - The headwords with L-nums having .1, .2 etc are usually alternate headwords or addenda pages. We should ignore these words in comparing.
  10. PGN, VEI - The headwords are not necessarily in alphabetic order, but thematically classified. So better to skip these dictionaries.
  11. SKD - This dictionary has tendency to put 'M' and 'H' after 'O'. See this page.
drdhaval2785 commented 8 years ago

Analysis completed. Will make the examination HTMLs soon.

gasyoun commented 8 years ago

There can be many reasons for such mismatch. Needs to be examined.

For example additions in PWG.

So

Remove addenda and corrigenda pages from all dictionaries.

Is the very first condition not to see lists of false positives.

Thus total of 7 errors in 10 pairs (20 words). 35% is decent enough result for our perspective.

I'm deeply impressed.

We can take up smaller number dicts first and then move to higher error ones.

Fully agree.

more workable list than the present 12,996 entries.

Right, otherwise a hell.

One possible explanation may be - Sort by verb, and second sort by pratyaya.

What do you mean sort by verb? Words from one verb - keep together?

doesn't distinguish between letter with diacritic and without diacritic

Just like sorted in MS Word :)

(Programmatically this would require identification of the L-num of last word of proper book. Any L-num which is more than that number should be ignored).

PW, PWK has 7 volumes. So it takes 7 L numbers to remember, not just one.

So better to skip these dictionaries.

Agree.

tendency to put 'M' and 'H' after 'O'.

Good catch. What consequences?

drdhaval2785 commented 8 years ago

https://github.com/sanskrit-lexicon/CORRECTIONS/commit/7948bef5cd2bfade1af0af0ecd77ffb7261c49c7 This commit accommodated the addenda / corrigenda / volume issues.

Now these false positives are removed. Also false positives because of lnums in 9,99,999,9999,99999 are also removed.

The latest log reads as follows. (30-Apr-2016)

Found 1265 mismatched words in ACC dictionary
Found 464 mismatched words in CAE dictionary
Found 0 mismatched words in AE dictionary
Found 3024 mismatched words in AP90 dictionary
Found 3633 mismatched words in AP dictionary
Found 91 mismatched words in BEN dictionary
Found 75 mismatched words in BHS dictionary
Found 80 mismatched words in BOP dictionary
Found 0 mismatched words in BOR dictionary
Found 336 mismatched words in BUR dictionary
Found 212 mismatched words in CCS dictionary
Found 65 mismatched words in GRA dictionary
Found 45 mismatched words in GST dictionary
Found 2646 mismatched words in IEG dictionary
Found 3058 mismatched words in INM dictionary
Found 28 mismatched words in KRM dictionary
Found 16 mismatched words in MCI dictionary
Found 67 mismatched words in MD dictionary
Found 5302 mismatched words in MW72 dictionary
Found 3 mismatched words in MW dictionary
Found 0 mismatched words in MWE dictionary
Found 290 mismatched words in PD dictionary
Found 2328 mismatched words in PE dictionary
Found 232 mismatched words in PGN dictionary
Found 517 mismatched words in PUI dictionary
Found 342 mismatched words in PWG dictionary
Found 290 mismatched words in PW dictionary
Found 105 mismatched words in SCH dictionary
Found 1071 mismatched words in SHS dictionary
Found 568 mismatched words in SKD dictionary
Found 108 mismatched words in SNP dictionary
Found 89 mismatched words in STC dictionary
Found 797 mismatched words in VCP dictionary
Found 89 mismatched words in VEI dictionary
Found 754 mismatched words in WIL dictionary
Found 793 mismatched words in YAT dictionary

Here each word is in pair. So double the number to calculate the words to be examined. Now most are in workable range. ACC, AP90, AP, IEG, INM, PUI, MW72, MW and PE may be skipped as of now. Rest are doable.

gasyoun commented 8 years ago

Found 5306 mismatched words in MW72 dictionary Found 20844 mismatched words in MW dictionary

Why soo many?

drdhaval2785 commented 8 years ago

Mostly because of sub headwords are given headword status. If @funderburkjim can give a list of only headwords (without alternate headword or subheadwords), I can try and let you know if there is an improvement.

gasyoun commented 8 years ago

Guess it's doable for @funderburkjim to have milk and water in two different dishes.

drdhaval2785 commented 8 years ago

Knock knock Jim. I am stuck up at some place where only you can help me. If this last wish is fulfilled (separate headword and subheadword lists for MW, MW72) - I am done. I won't do any programmatic improvement in this approach. This documentation thread would end.

funderburkjim commented 8 years ago

@drdhaval2785 I'm engrossed in working on the two lists (#291, #287); this will probably be finished next week.

I'm not exactly sure of what it is you need from me to proceed. For determining alphabetical misordering, as I think of it, all you need, for a given dictionary is the list of headwords for that dictionary, in the order presented by the dictionary (e.g., aphw2.txt, mw72hw2.txt, etc.) Such a list is present for all dictionaries except mw, but there's not much point in trying to search for misorderings in mw.

Perhaps your use of sanhw2 (which appears to have the lnums) is an alternate way to essentially regenerate the sequential ordering of, say aphw2.txt, That's a good idea.

So, again, let me know more specifically what you need from me to proceed. I'll try to get it for you so you can continue.

Note: Please defer addressing BEN and SKD misorderings until I'm finished with #291, #287 - so we don't duplicate work.

drdhaval2785 commented 8 years ago

there's not much point in trying to search for misorderings in mw.

Can you be more clear on this statement?

let me know more specifically what you need from me to proceed.

I need separate list of only headwords (except subheadwords or alternate headwords) in MW.

Please defer addressing BEN and SKD misorderings until I'm finished with

291 https://github.com/sanskrit-lexicon/CORRECTIONS/issues/291, #287

https://github.com/sanskrit-lexicon/CORRECTIONS/issues/287 -

Agree.

Dr. Dhaval Patel, I.A.S Collector and District Magistrate, Anand www.sanskritworld.in

funderburkjim commented 8 years ago

Re there's not much point in trying to search for misorderings in mw.

The file all.txt might be a good place to look. Although I'd have to look at the construction of this file to be sure of all the details, my memory is that this file shows all the headwords of MW EXCEPT the 'HxA' records (the 'A' suffix records are those which give alternate senses); it also does not include records with <lex type="inh"> (inherited gender) for B and C records.)

This list of currently 220,262 headwords is my current best answer to the question 'What are the headwords of MW'. This list could easily be filtered to give distinct headwords of MW.

The records are in L-number order, and show (in 5 tab-delimited fields), the H-code, the L-number, key1, key2 (in a simplified form); the 5th field shows a 'type' code, which is usually a normalized gender, but also includes other category designations.

You'll see many duplicate headwords, sometimes with sequential L-numbers (like 'a'), sometimes with non-sequential L-numbers (like 'nis'). It might be interesting to try to classify all the reasons for duplicate headwords.

So, looking for alphabetical misorderings would be hard; at least that's my intuition, and why I said there's not much point in trying to search for misorderings in MW.

gasyoun commented 8 years ago

It might be interesting to try to classify all the reasons for duplicate headwords.

Sure it is.

why I said there's not much point in trying to search for misorderings in MW.

Still can't grasp. Hard it might be, but why you think it's not of the same good as it was for most others - I do not understand still.

drdhaval2785 commented 8 years ago

I fiddled with all.txt and agree with the logic of Jim that it is fruitless to apply alphabetic misordering on MW.

gasyoun commented 8 years ago

fruitless to apply alphabetic misordering on MW.

Oh.