drdhaval2785 commented 8 years ago

k1k2clash

This subrepository examines the possibility of comparing key1 and key2 of different dictionaries

Output

https://github.com/sanskrit-lexicon/CORRECTIONS/blob/master/k1k2/k1k2clash.txt

Standard Convention

dict:key1text:k2xml:key1text:k2xml:n:

e.g.

pw:uttaratOpacAra:uttarata/_upacAra:uttaratOpacAra:uttarata/_upacAra:n:

Code

https://github.com/sanskrit-lexicon/CORRECTIONS/tree/master/k1k2

gasyoun commented 8 years ago

Great, thanks, good job! 5.3k seems too much. New (old) issues arise. Like the splitting of 2-in-1 headwords. Oh it's trouble comming our way. I see some broken code false positives as well:

yat:kruqa:<key2>kruqa. (Sa) kruqati</key2>:kruqa:<key2>kruqa. (Sa) kruqati</key2>:n:
skd:rAjajakzmA:<key2>rAjajakzmA [n] puM</key2>:rAjajakzmA:<key2>rAjajakzmA [n] puM</key2>:n:
yat:trump:<key2>trump trumpati</key2>:trump:<key2>trump trumpati</key2>:n:

Others will remain such, as they are 2 in 1 headwords. Ideally for our sake they should be split into 2 different ghost-words. So maybe key3 for such cases, @funderburkjim ? I will never search for karbU(rvU)ra, but I could search for karbUra or karvUra.

vcp:karbUra:<key2>karbU(rvU)ra</key2>:karbUra:<key2>karbU(rvU)ra</key2>:n:

Some are already split

gra:jaB:<key2>jaB, jamB</key2>:jaB:<key2>jaB, jamB</key2>:n:

funderburkjim commented 8 years ago

The consistency check between key1 and key2 cannot currently be done globally (for all dictionaries) in the same way it is done for MW; at least that is my intuition.

One place where this comparison would, I think, be quite productive is for PD, as discused in #118.

There, key1 and key2 are coded independently.
A good display to start with, mentioned in #118 is pd_deva_ne_iast.txt.

For instance,

1-0002b:aicadeva:208,209:aicadeva:Ecadeva
1-0032a:aMhi:3979,4011:am3h-ri:aMhri

tells us that, for the second example,

1-0032a is the page
aMhi is key1 (in SLP1) (appears as Devanagari in print)
3979,4011 are the range of lines in pd.txt
am3h-ri is the independently coded 'key2' . This appears as IAST in print. pd.txt codes it as AS.
key1a - This is programmatically derived by converting key2 to a key1 form.

Notes:

There are 1427 items noted in the file, where there are discrepancies between the Devanagari key1 and the key1a computed from the IAST version
Some of these are likely due to the ambiguity of IAST for words with hiatus (contiguous vowels). The 'aicadeva' example is like this, which could be considered a false positive.
If one looks at pd.xml, he will see that in that, key2 is always identical to key1. This is because of the programmatic construction of pd.xml from pd.txt. In the beginning, there is only pd.txt. And pd.xml is an artifact constructed from pd.txt. The current artifact constructor does not make any use of the IAST in its construction of key2. If, as I think is likely, almost all of the entries in pd.txt have this IAST form in a parsable location, then the program (make_xml.py) for pd could be altered to construct key2 from this IAST form.
Incidentally, the 'aMhi' example appears to me to be a typo in key1 (i.e., key1 should be aMhri)
- I hope some adventurous soul among us will find it interesting to take on the task of resolving the differences present in this list from PD.

drdhaval2785 commented 8 years ago

@funderburkjim, Right now the code is not doing generic comparision. It is doing some dictionary specific adjustments to accommodate known patterns (to decrease the false positives). The initial results are encouraging.

See AP90 cases where there was a wrong generation of key1 from key2.

ap90:aMgulI:<key2>aMgulI(rI) yaM-kaM, --yakaM</key2>:aMgulI:<key2>aMgulI(rI) yaM-kaM, --yakaM</key2>:n:
ap90:ajju:<key2>ajju(jjU) kA</key2>:ajju:<key2>ajju(jjU) kA</key2>:n:
ap90:aDipu:<key2>aDipu(pU) ruzaH</key2>:aDipu:<key2>aDipu(pU) ruzaH</key2>:n:
ap90:a:<key2>a nulepaka, --lepin</key2>:a:<key2>a nulepaka, --lepin</key2>:n:
ap90:apAMpitta:<key2>apAMpitta ºnapAt</key2>:apAMpitta:<key2>apAMpitta ºnapAt</key2>:n:

Therefore, in my opinion, the exercise is not that fruitless either.

gasyoun commented 8 years ago

@funderburkjim I agree with @drdhaval2785 and hope I'll convince @Shalu411 to find help for PD from India.

drdhaval2785 commented 8 years ago

@funderburkjim and @gasyoun I finally think that I have improved the code to the best of my abilities. Examine https://github.com/sanskrit-lexicon/CORRECTIONS/blob/master/k1k2/k1k2clash.txt and let me know whether there are still some generic false positives which can be removed by coding. Otherwise I go ahead create an HTML with webpage link and PDF link for a better User interface.

funderburkjim commented 8 years ago

Re ` not that fruitless ' You've been finding so many good ways of identifying errors of various kinds. Keep at it!

Regarding the PD list, probably some patterns are identifiable that programs can leverage to make intelligent guesses and thereby make the resolution of the rather large (1400+) number of differences feasible. If the aMhi/aMhri example is typical, this will result in numerous corrections to PD digitization. I don't think we should wait for Shalu to tackle this. But, I think the k1k2clash approach should go first, since Dhaval finds it interesting.

sanskrit-lexicon / CORRECTIONS

k1k2clash documentation #213

k1k2clash

Output

Standard Convention

Code