sanskrit-lexicon / CORRECTIONS

Correction history for Cologne Sanskrit Lexicon
8 stars 5 forks source link

k1k2clash documentation #213

Open drdhaval2785 opened 8 years ago

drdhaval2785 commented 8 years ago

k1k2clash

This subrepository examines the possibility of comparing key1 and key2 of different dictionaries

Output

https://github.com/sanskrit-lexicon/CORRECTIONS/blob/master/k1k2/k1k2clash.txt

Standard Convention

dict:key1text:k2xml:key1text:k2xml:n:

e.g.

pw:uttaratOpacAra:uttarata/_upacAra:uttaratOpacAra:uttarata/_upacAra:n:

Code

https://github.com/sanskrit-lexicon/CORRECTIONS/tree/master/k1k2

gasyoun commented 8 years ago

Great, thanks, good job! 5.3k seems too much. New (old) issues arise. Like the splitting of 2-in-1 headwords. Oh it's trouble comming our way. I see some broken code false positives as well:

yat:kruqa:<key2>kruqa. (Sa) kruqati</key2>:kruqa:<key2>kruqa. (Sa) kruqati</key2>:n:
skd:rAjajakzmA:<key2>rAjajakzmA [n] puM</key2>:rAjajakzmA:<key2>rAjajakzmA [n] puM</key2>:n:
yat:trump:<key2>trump trumpati</key2>:trump:<key2>trump trumpati</key2>:n:

Others will remain such, as they are 2 in 1 headwords. Ideally for our sake they should be split into 2 different ghost-words. So maybe key3 for such cases, @funderburkjim ? I will never search for karbU(rvU)ra, but I could search for karbUra or karvUra.

vcp:karbUra:<key2>karbU(rvU)ra</key2>:karbUra:<key2>karbU(rvU)ra</key2>:n:

Some are already split

gra:jaB:<key2>jaB, jamB</key2>:jaB:<key2>jaB, jamB</key2>:n:
funderburkjim commented 8 years ago

The consistency check between key1 and key2 cannot currently be done globally (for all dictionaries) in the same way it is done for MW; at least that is my intuition.

One place where this comparison would, I think, be quite productive is for PD, as discused in #118.

There, key1 and key2 are coded independently.
A good display to start with, mentioned in #118 is pd_deva_ne_iast.txt.

For instance,

1-0002b:aicadeva:208,209:aicadeva:Ecadeva
1-0032a:aMhi:3979,4011:am3h-ri:aMhri

tells us that, for the second example,

Notes:

drdhaval2785 commented 8 years ago

@funderburkjim, Right now the code is not doing generic comparision. It is doing some dictionary specific adjustments to accommodate known patterns (to decrease the false positives). The initial results are encouraging.

See AP90 cases where there was a wrong generation of key1 from key2.

ap90:aMgulI:<key2>aMgulI(rI) yaM-kaM, --yakaM</key2>:aMgulI:<key2>aMgulI(rI) yaM-kaM, --yakaM</key2>:n:
ap90:ajju:<key2>ajju(jjU) kA</key2>:ajju:<key2>ajju(jjU) kA</key2>:n:
ap90:aDipu:<key2>aDipu(pU) ruzaH</key2>:aDipu:<key2>aDipu(pU) ruzaH</key2>:n:
ap90:a:<key2>a nulepaka, --lepin</key2>:a:<key2>a nulepaka, --lepin</key2>:n:
ap90:apAMpitta:<key2>apAMpitta ºnapAt</key2>:apAMpitta:<key2>apAMpitta ºnapAt</key2>:n:

Therefore, in my opinion, the exercise is not that fruitless either.

gasyoun commented 8 years ago

@funderburkjim I agree with @drdhaval2785 and hope I'll convince @Shalu411 to find help for PD from India.

drdhaval2785 commented 8 years ago

@funderburkjim and @gasyoun I finally think that I have improved the code to the best of my abilities. Examine https://github.com/sanskrit-lexicon/CORRECTIONS/blob/master/k1k2/k1k2clash.txt and let me know whether there are still some generic false positives which can be removed by coding. Otherwise I go ahead create an HTML with webpage link and PDF link for a better User interface.

funderburkjim commented 8 years ago

Re ` not that fruitless ' You've been finding so many good ways of identifying errors of various kinds. Keep at it!

Regarding the PD list, probably some patterns are identifiable that programs can leverage to make intelligent guesses and thereby make the resolution of the rather large (1400+) number of differences feasible. If the aMhi/aMhri example is typical, this will result in numerous corrections to PD digitization. I don't think we should wait for Shalu to tackle this. But, I think the k1k2clash approach should go first, since Dhaval finds it interesting.