Open drdhaval2785 opened 8 years ago
Great, thanks, good job! 5.3k seems too much. New (old) issues arise. Like the splitting of 2-in-1 headwords. Oh it's trouble comming our way. I see some broken code false positives as well:
yat:kruqa:<key2>kruqa. (Sa) kruqati</key2>:kruqa:<key2>kruqa. (Sa) kruqati</key2>:n:
skd:rAjajakzmA:<key2>rAjajakzmA [n] puM</key2>:rAjajakzmA:<key2>rAjajakzmA [n] puM</key2>:n:
yat:trump:<key2>trump trumpati</key2>:trump:<key2>trump trumpati</key2>:n:
Others will remain such, as they are 2 in 1 headwords. Ideally for our sake they should be split into 2 different ghost-words. So maybe key3
for such cases, @funderburkjim ? I will never search for karbU(rvU)ra
, but I could search for karbUra
or karvUra
.
vcp:karbUra:<key2>karbU(rvU)ra</key2>:karbUra:<key2>karbU(rvU)ra</key2>:n:
Some are already split
gra:jaB:<key2>jaB, jamB</key2>:jaB:<key2>jaB, jamB</key2>:n:
The consistency check between key1 and key2 cannot currently be done globally (for all dictionaries) in the same way it is done for MW; at least that is my intuition.
One place where this comparison would, I think, be quite productive is for PD, as discused in #118.
There, key1 and key2 are coded independently.
A good display to start with, mentioned in #118 is pd_deva_ne_iast.txt.
For instance,
1-0002b:aicadeva:208,209:aicadeva:Ecadeva
1-0032a:aMhi:3979,4011:am3h-ri:aMhri
tells us that, for the second example,
Notes:
@funderburkjim, Right now the code is not doing generic comparision. It is doing some dictionary specific adjustments to accommodate known patterns (to decrease the false positives). The initial results are encouraging.
See AP90 cases where there was a wrong generation of key1 from key2.
ap90:aMgulI:<key2>aMgulI(rI) yaM-kaM, --yakaM</key2>:aMgulI:<key2>aMgulI(rI) yaM-kaM, --yakaM</key2>:n:
ap90:ajju:<key2>ajju(jjU) kA</key2>:ajju:<key2>ajju(jjU) kA</key2>:n:
ap90:aDipu:<key2>aDipu(pU) ruzaH</key2>:aDipu:<key2>aDipu(pU) ruzaH</key2>:n:
ap90:a:<key2>a nulepaka, --lepin</key2>:a:<key2>a nulepaka, --lepin</key2>:n:
ap90:apAMpitta:<key2>apAMpitta ºnapAt</key2>:apAMpitta:<key2>apAMpitta ºnapAt</key2>:n:
Therefore, in my opinion, the exercise is not that fruitless either.
@funderburkjim I agree with @drdhaval2785 and hope I'll convince @Shalu411 to find help for PD
from India.
@funderburkjim and @gasyoun I finally think that I have improved the code to the best of my abilities. Examine https://github.com/sanskrit-lexicon/CORRECTIONS/blob/master/k1k2/k1k2clash.txt and let me know whether there are still some generic false positives which can be removed by coding. Otherwise I go ahead create an HTML with webpage link and PDF link for a better User interface.
Re ` not that fruitless ' You've been finding so many good ways of identifying errors of various kinds. Keep at it!
Regarding the PD list, probably some patterns are identifiable that programs can leverage to make intelligent guesses and thereby make the resolution of the rather large (1400+) number of differences feasible. If the aMhi/aMhri example is typical, this will result in numerous corrections to PD digitization. I don't think we should wait for Shalu to tackle this. But, I think the k1k2clash approach should go first, since Dhaval finds it interesting.
k1k2clash
This subrepository examines the possibility of comparing key1 and key2 of different dictionaries
Output
https://github.com/sanskrit-lexicon/CORRECTIONS/blob/master/k1k2/k1k2clash.txt
Standard Convention
dict:key1text:k2xml:key1text:k2xml:n:
e.g.
pw:uttaratOpacAra:uttarata/_upacAra :uttaratOpacAra:uttarata/_upacAra :n:
Code
https://github.com/sanskrit-lexicon/CORRECTIONS/tree/master/k1k2