sanskrit-lexicon / CORRECTIONS

Correction history for Cologne Sanskrit Lexicon
8 stars 5 forks source link

o_vs_O Sorting Features #136

Open gasyoun opened 9 years ago

gasyoun commented 9 years ago

https://github.com/sanskrit-lexicon/CORRECTIONS/issues/135 continued. Exploring http://drdhaval2785.github.io/o_vs_O/output1/MW.html - what I was hunting for years.

dhaval

  1. Instead of fake ID, I would prefer to use the L ID that Jim invented years ago. Sure it's nice to know if all three tables contain 4500 or 500 mistakes only. But for that we can have a stats block. Otherwise the "I have not used the updated sanhw1.txt, just because it would alter the numbering both of you and zaff are referring to" argument is not strong enough. Even after sanhw1.txt is updated, we keep using the old one. That could lead fixing already fixed issues in other files. So no good idea at all.
  2. If I copy-paste full line now, I get: 240 dyAvAkzAmA dyAvAkzamA द्यावाक्षामा द्यावाक्षमा MW SHS,VCP,WIL,YAT I would want to have an option to click on the L ID (7357 instead of 240) and get: 7357 dyAvAkzAmA (MW) => dyAvAkzamA (SHS,VCP,WIL,YAT) I guess some Ajax script could solve the task. It would need to copy the data, that I could paste afterwards.
  3. When I click on 2nd column SLP1 record, bring me to the text entry, not the PDF scan. In 3rd column - open PWG,PW,MW72 or GRA. If none, open any other.
  4. Sort by (ordered by hypotethical productivity after working similar AP .XLS file): 1) Word length (count letters) - sort longest words first. If it's not an anusvara orthography difference, chance is high that a word with 14 letters has a real OCR error and 4-letter word has none. 2) Most dictionaries, if a variant is in 11 dictionaries, chance are bigger it's a real mistake than if it's only in 2 dictionaries, example dURAsa (MW) vs. dURASa (AP,BUR,CAE,CCS,GRA,PW,PWG,SHS,VCP,WIL,YAT) 3) Letter that differs (Letter in MW, Letter in non-MW), 74347 vadhūṭīśayana vadhūṭaśayana ī a -- m ṃ (boring, 5% interest) -- m k (different consontants, possibly interesting, but if praefix - different word, 10% interest) -- ā a (long vowels vs. short vowels, 20% interest) 4) Place of letter that differs (praefix, infix, suffix)

@drdhaval2785 @juhnowski what do you think? Do you understand at least a few of the wanted requirements? Thanks for reading.

juhnowski commented 9 years ago

Hello, @drdhaval2785 @gasyoun

After reading it I had following questions:

Thank you.

drdhaval2785 commented 9 years ago

For @gasyoun 's remarks

1. Instead of fake ID, I would prefer to use the L ID

Input file sanhw1.txt doesn't have the L IDs in it. So no way I can generate the L ID. @funderburkjim would have to provide it in sanhw1.txt or he can show some python which can fetch the L ID from the input word for a particular dictionary.

I am not sure whether the L IDs are fixed ? I mean when the updation is made, do they change or remain static ? E.g. When I remove the unwanted X word from a dictionary, do the words coming after that particular entry still retain their old L IDs or they are reserialised and move up one number? @funderburkjim would like to comment on the structure of L IDs. If they are also reserialised, they are also not infallible.

2. option to click on the L ID

It seems to be domain of ajax or javascript. None possible from my side, because none is my forte. Someone proficient in these things may try.

drdhaval2785 commented 9 years ago
3. When I click on 2nd column SLP1 record, bring me to the text entry, not the PDF scan. In 3rd column - open PWG,PW,MW72 or GRA. If none, open any other.

Done. Uploaded via https://github.com/drdhaval2785/drdhaval2785.github.io/commit/6016e1d924dea6d1c5d72c7f7e6113d37153fb41 . Now the 2nd column has link to the text entry of the dictionary under consideration. 3rd column links to PWG,PW,MW72,GRA in that order, if the word exists in those dictionaries. If not, the first dictionary text is linked. N.B. The order is extensible i.e. you can specify the order for all dictionaries in descending order. PWG,PW,MW72,GRA,MW,AP..... onwards as long as you want (maybe all dictionaries). @gasyoun or @funderburkjim would like to order them in the rank of importance of dictionaries.

See c

drdhaval2785 commented 9 years ago
4. Sort by (ordered by hypotethical productivity after working similar AP .XLS file)

Too much of hassle, compared to the productivity increase. Now the difference is highlighted. You may choose not to open the file if it is only M v/s m difference. Same holds true for the word length. It is easily visible now. NOT worthwhile to enter into it, if we are dealing with only the 'Highest probability' cases as we are doing now. When and if we deal with 'Medium probability' cases, this feature would be welcome addition.

drdhaval2785 commented 9 years ago
7357 dyAvAkzAmA (MW) => dyAvAkzamA (SHS,VCP,WIL,YAT)

As the other highlighting and linking event is over now, I won't advise you to keep SLP1 there. Because it is marked with some tags now.

7357 द्यावाक्षामा (MW) => द्यावाक्षमा (SHS,VCP,WIL,YAT)

would be more advisable. col4 (col6) -> col5 (col7) format to be precise.

drdhaval2785 commented 9 years ago

In a nutshell, Point 1 - Questions posed. Point 2 - As @juhnowski or @funderburkjim think fit. I have no idea about javascripts. But keep https://github.com/sanskrit-lexicon/CORRECTIONS/issues/136#issuecomment-152813928 in mind. Point 3 - done. Point 4 - Decided against it for now.

gasyoun commented 9 years ago

@drdhaval2785 L IDs are fixed. Ajax will be done by @juhnowski 7357 द्यावाक्षामा (MW) => द्यावाक्षमा (SHS,VCP,WIL,YAT) so be it.

drdhaval2785 commented 9 years ago

L IDs are fixed. Do I take that you leave your insistence?

juhnowski commented 9 years ago

I don't object

gasyoun commented 9 years ago

@drdhaval2785 what is the question? L's are invented by Jim once. Never again.

funderburkjim commented 9 years ago

Although I don't really understand the display, a question was raised regarding the stability of 'L' numbers.

For all dictionaries, L numbers rarely change. They were invented to permit precision when discussing items.

For MW, I am tempted to say that they never change. In fact, they are embedded into monier.xml. The 'base form' of the MW dictionary is the (Mysql version) of monier.xml. This is the form which gets changed when a correction is made.

For any dictonary other than MW, the base form of the digitized dictionary is x.txt , where x is the code for that dictionary. For instance, pwg.txt is the base form of the PWG dictionary. This base form is the form which is changed when a correction is made.

The L-number is NOT embedded into the base form x.txt. The L-number is generated when the xml form of the dictionary (x.xml, , e.g. pwg.xml) is generated from x.txt. This xml form is the one used in the displays.

Thus, the L-number for the other dictionaries is theoretically not as stable as the L-number for MW records.

However, the only way the L-number for a given record in one of these other dictionaries would change is from a correction involving either an insertion or a deletion of a headword. This is very rare (i can't think of an instance, except for CCS last year). Just changing the spelling of a given headword would not change the L-number. The generated L-number is just the headword sequence number according to the dictionary order of headword entries.

One detail involving the relation between L-numbers and headwords. For a given spelling of a headword, there can be multiple records in a dictionary , thus multiple L-numbers. I have to take this into account when I install a correction.

Is there a need for a modified sanhw1 which somehow lists L-numbers?

gasyoun commented 9 years ago

Thanks for a paper on the history of Ls in the late XX century :radio: "Is there a need for a modified sanhw1 which somehow lists L-numbers?" - I guess yes, otherwise we get an issue. If our ID changes in o vs. O files after every regeneration, than these numbers are not of big value. Something like the original L would last longer and for that sake make more sense. To help Dhaval help me clean up the headwords that would help.

funderburkjim commented 9 years ago
  1. Have refreshed sanhw1.txt in this repository
  2. Have made a sanhw2.txt, which adds L-numbers to the dictionary codes. A comparison of of sanhw1 and sanhw2 should explain:

sanhw1:

afRin:AP,AP90,GST,MW,MW72,PD,PW,PWG,SHS,STC,VCP,WIL

sanhw2:

afRin:AP90;2,AP;2,GST;2,MW72;7,MW;8,PD;20,PW;5,PWG;6,SHS;3,STC;3,VCP;3,WIL;4

The data for a headword in sanhw1 is a comma-separated list of dictionary codes.

The data for that same headword in sanhw2 is a comma-separated list of 'pairs', with a pair consisting of DICT;L a dictionary code then a semicolon then an L-number. The L-number is the first L-number for the given headword in the given dictionary.

This sanhw2 should suffice for changing the label (column 1 of o_vs_O display) from a sequence number to an L-number.

gasyoun commented 9 years ago

@funderburkjim thanks. Now it's up to @drdhaval2785

drdhaval2785 commented 9 years ago

When I was reconstructing the file after sanhw2.txt, the following observation came to forth.

aMSalaH:aMsalaH-SKD;10:SKD;24
aMSApay:aMsApay-PWG;17:PWG;43
agadaH:agADaH-SKD;186:SKD;198
agriyaM:agrIyaM-SKD;301:SKD;303

There are many intra-dictionary such pairs, which missed our attention earlier because of the same dictionary. But now we have dictionary with L IDs. So they are caught.

IMHO, this is useless. They are mostly false positives (and the numbers are quite large - which may make our o_vs_O files unusable).

Currently I am not keeping these entries.

When both sides of the ':' are same dictionary, I am ignoring it for now.

@gasyoun and @funderburkjim may like to comment on the usefulness or otherwise of these entries. To me, they are useless and should be discarded.

gasyoun commented 9 years ago

When both sides of the ':' are same dictionary, I am ignoring it for now. - I fully agree. Pure false positives. We need high productivity methods now. Such that can help clean 5000 words out of 435000 list.

drdhaval2785 commented 9 years ago

@gasyoun 's much coveted L IDs are now added. As this was a substantial change, and many of the earlier references were to non L ID files, I have decided to keep both.

Non-L ID http://drdhaval2785.github.io/o_vs_O/output1/PWG.html

With L ID http://drdhaval2785.github.io/o_vs_O/output2/PWG.html

Note the folder output2 (generated from sanhw2.txt). Replace PWG with MW,AP,PW or whatever dictionary you want to examine.

Code notes -

php o_vs_O_sanhw2.php
php dictsorting_sanhw2.php

Dictionary abbreviations -

$dictionaryname=array("ACC","CAE","AE","AP90","AP","BEN","BHS","BOP","BOR","BUR","CCS","GRA","GST","IEG","INM","KRM","MCI","MD","MW72","MW","MWE","PD","PE","PGN","PUI","PWG","PW","SCH","SHS","SKD","SNP","STC","VCP","VEI","WIL","YAT");
drdhaval2785 commented 9 years ago

Requirements 1 and 3 done. 2 - @juhnowski would do 4 - decided against it.

gasyoun commented 9 years ago

4 - never against. All that remains is @juhnowski

funderburkjim commented 9 years ago

From the example aMSalaH:aMsalaH-SKD;10:SKD;24 (note to self, this is not part of sanhw2)

This appears to be a case related to the 'known spelling variants' , as per hwnorn1. Don't know whether this observation would help in pruning the cases to consider. Thought it should be mentioned.

drdhaval2785 commented 9 years ago

The cases have been pruned. No issue about it. It was just for documentation that it was mentioned. sanhw2.txt is fine for our purpose.