Closed drdhaval2785 closed 9 years ago
There is no search across dictionaries. When checking for spelling errors we badly need it, that I know. I'm thinking about a .bat shell command, that would download always the latest .zip files from the website, extract the needed .XML files, extract the headwords from .XML files with my .vbee script in EmEditor and compile them together. But the compilation part is where I do not know how to make it tracable after who is who - because just L
number is not enough.
https://github.com/sanskrit-lexicon/Cologne/issues/43 might be of interest.
@funderburkjim It is time to revive this thread. Very important.
@drdhaval2785 What are you needing from me? Also, where is 'list of hiatus provided by Marcis'
@funderburkjim - Marcis's hiatus list (Two vowels consecutively) of MW is here. To understand the problem, please see http://www.sanskrit-lexicon.uni-koeln.de/scans/PWScan/2014/web/webtc/indexcaller.php?key=titO&input=slp1&output=SktDevaUnicode
Look at key = titO. The original thing must have been titau. But running transcoders may have converted it to titO unknowingly. These are risk places for any transcoder. As there is an error of titau-> titO in one dictionary PW, then there are chances that such transliteration errors may occur in other dictionaries too.
My request is:
It is a generic correction. Therefore, I don't want to do manual labor, which you can do with code. Your job is to create the list in point 2.
Hope I am clear this time
Dhaval is strict. Jim, hope your pythons are hungry again. I can't hold Dhaval away from inventing new tasks, I can't :o:
@Dhaval - Regarding your point # 2: Is it a matter of extracting the lines of sanhw1.txt that match the words in Marcis's hiatus ?
As background: Regarding why the 'titO' errors occur. In Thomas's digitizations, Thomas represents Devanagari words using the HK transliteration. This HK transliteration represents the two diphthongs
slp1 HK
O au
E ai
Thus, there is no way for the usual HK to represent an a-u or a-i hiatus.
I have gotten around this for one or two dictionaries by 'adding a 0' in the HK:
slp1 HK-extended
au au0
ai ai0
@funderburkjim Is it a matter of extracting the lines of sanhw1.txt that match the words in Marcis's hiatus ? No it is a matter of extracting the lines of sanhw1.txt that matches the wrong combination of Marcis's hiatus.
We dont want a list of words like 'titau'. We want a list of words like 'titO' which are potentially wrong.
My suggested steps would be A. Remove the words having no 'au' or 'ai' from Marcis's list. B. For the rest of it, str_replace(array("au","ai"),array("O","E"),$text) , where $text is Marcis's remaining hiatus list. (e.g. converts titau -> titO) C. We will search for this 'titO' kind of list with sanhw1.txt.
This should create our step 2 list
Doable?
Let's kill HK. Feed the pythons again :snake:
Here's result of carrying out step2: https://dl.dropboxusercontent.com/u/29859999/hiatusDropbox.zip
Here's the subset of hiatusOccur that probably corresponds to needed corrections:
aDaupAsana:!=:aDOpAsana:GST,PD,PW
uBayataukTya:!=:uBayatOkTya:PW
uBayataHprauga:!=:uBayataHprOga:PW
gaizwi:!=:gEzwi:PW,PWG
titau:!=:titO:BEN,CAE,CCS,GRA,MW72,SHS,VCP,WIL,YAT
dakziRataupavItin:!=:dakziRatOpavItin:PW
durvAsaupAKyAna:!=:durvAsOpAKyAna:PW
namaukti:!=:namOkti:CAE,CCS,GRA,PW,PWG
parauru:!=:parOru:PW,PWG
parauzRih:!=:parOzRih:PW,PWG
purauzRih:!=:purOzRih:CAE,MD,PW,PWG
prauga:!=:prOga:CAE,CCS,MW,MW72,PW,PWG,VCP
praugya:!=:prOgya:PWG
rajaudvAsA:!=:rajOdvAsA:PW
vasyaizwi:!=:vasyEzwi:CAE,CCS,GRA,PWG
Siraupanizad:!=:SirOpanizad:PW
sAtasaikA:!=:sAtasEkA:PW,PWG
hiraRyaprauga:!=:hiraRyaprOga:GRA,PW,PWG
For instance, the first line
aDaupAsana:!=:aDOpAsana:GST,PD,PW
means that the likely incorrect spelling aDOpAsana:occurs in the headwords of three dictionaries.
If you concur with this list, I'll go ahead and develop the corrections.
@funderburkjim I concur with the list. It needs correction for sure. Great job.
Great as usual.
All these hiatus corrections have been made, along with a few others noticed along the way.
For all the dictionaries involved (except PD) in this hiatus-correction binge, an SLP1 version (X_orig_utf8_slp1.txt) has been made, and things adjusted so this slp1 version is now the base form to be used for further corrections. The programs and peculiarities are discussed in the convertwork/readme.txt file prepared as part of the xml download for each dictionary.
It is planned to do SLP1 conversion for PD and at least the other major dictionaries (notably Apte) which have not been converted. For the time being, the 'minor' dictionaries will remain with their base form as HK, and be converted when our attention is drawn to them.
For reference, here are notes on the hiatus corrections that were made.
--------------------------------------------------
Misspellings for GST DONE Nov 3, 2014 (converted to slp1)
aDaupAsana:!=:aDOpAsana
--------------------------------------------------
Misspellings for PD DONE Nov 4, 2014 (NOT YET converted to SLP1)
aDaupAsana:!=:aDOpAsana
ALSO:
aDOcCizwa aDOpariguRita aDaHprOgam
Ecadeva EculA Ebuka Evuli
--------------------------------------------------
Misspellings for PW DONE Nov 7, 2014 converted to slp1
aDaupAsana:!=:aDOpAsana
uBayataukTya:!=:uBayatOkTya
uBayataHprauga:!=:uBayataHprOga
gaizwi:!=:gEzwi
dakziRataupavItin:!=:dakziRatOpavItin
durvAsaupAKyAna:!=:durvAsOpAKyAna
namaukti:!=:namOkti
parauru:!=:parOru
parauzRih:!=:parOzRih
purauzRih:!=:purOzRih
prauga:!=:prOga : Not an error. prOga is given as a 'bad spelling' of prauga
rajaudvAsA:!=:rajOdvAsA
Siraupanizad:!=:SirOpanizad
sAtasaikA:!=:sAtasEkA
hiraRyaprauga:!=:hiraRyaprOga
Done previously: vasyEzwi -> vasyaizwi , appEdIkzita -> appaidIkzita
--------------------------------------------------
Misspellings for PWG DONE Nov 6, 2014 converted to slp1
gaizwi:!=:gEzwi
namaukti:!=:namOkti
parauru:!=:parOru
parauzRih:!=:parOzRih
purauzRih:!=:purOzRih
prauga:!=:prOga
praugya:!=:prOgya
vasyaizwi:!=:vasyEzwi
sAtasaikA:!=:sAtasEkA
hiraRyaprauga:!=:hiraRyaprOga
--------------------------------------------------
Misspellings for BEN DONE converted to SLP1
titau:!=:titO
--------------------------------------------------
Misspellings for CAE DONE convert to slp1 Nov 8
titau:!=:titO
namaukti:!=:namOkti
purauzRih:!=:purOzRih
prauga:!=:prOga
vasyaizwi:!=:vasyEzwi
--------------------------------------------------
Misspellings for CCS Done Nov 9. Converted to SLP1
titau:!=:titO
namaukti:!=:namOkti
prauga:!=:prOga
vasyaizwi:!=:vasyEzwi
--------------------------------------------------
Misspellings for GRA Done Nov 10, 2014
Headwords are in AS coding of IAST.
The SLP1 forms of headwords are computed by hw2.py.
So, this program is altered for these cases.
Note, the hw1 form 'pra4uga' gets computed properly by accident (to prauga)
since a4 converts to 'a' in as_slp1.xml.
There is no question here of converting HK to SLP1, since there is no HK.
titau:!=:titO
namaukti:!=:namOkti
vasyaizwi:!=:vasyEzwi
hiraRyaprauga:!=:hiraRyaprOga
--------------------------------------------------
Misspellings for MW72 Done Nov 10. Also, convert to SLP1
titau:!=:titO
prauga:!=:prOga
--------------------------------------------------
Misspellings for SHS DONE Nov 11. Converted to SLP1
titau:!=:titO
--------------------------------------------------
Misspellings for VCP DONE (already converted to SLP1)
titau:!=:titO
prauga:!=:prOga
--------------------------------------------------
Misspellings for WIL Nov 14, 2014 done. convert to SLP1
titau:!=:titO
--------------------------------------------------
Misspellings for YAT Nov 16 Done. Converted to SLP1
titau:!=:titO
--------------------------------------------------
Misspellings for MD Nov 16 Done. Converted to SLP1
purauzRih:!=:purOzRih
--------------------------------------------------
Misspellings for MW DONE (already converted to SLP1)
prauga:!=:prOga NOT changed. 'prauga' is a separate headword.
For this prOga, the defn is w.r. for prauga. HOWEVER, the
definition IS wrongly spelled, and is corrected.
@drdhaval2785 I'll let you close this issue if you think it appropriate.
@funderburkjim Kudos!
no way for the usual HK to represent an a-u or a-i hiatus.
I wonder if it was ever documented by Peter Scharf.
Wiki says thus-
Sanskrit text encoded in the Harvard-Kyoto convention can be unambiguously converted to Devanāgarī, with two exceptions: Harvard-Kyoto does not distinguish अइ (a followed by i, in separate syllables, i.e. in hiatus) from ऐ (the diphthong ai) or अउ (a followed by u) from औ (the diphthong au). However such a vowel hiatus extremely rarely would occur inside words. Such a hiatus most often occurs in sandhi between two words (e.g. a sandhi of a word ending in 'aH' and one beginning with 'i' or 'u'). Since in such a situation a text transliterated in Harvard-Kyoto would introduce a space between the 'a' and 'i' (or 'a' and 'u') no ambiguity would result.
-------------------
Probably an underscore could also be used; or now that we have unicode, this is the simplest solution-
If two English characters are making one Devanagari vowel (ex: ai, ou), then, ZWJ or ZWNJ character can be used to separate them into different vowels.
Example: iMDiyainfo = इंडियैन्फ़ो iMDiya^info = इंडियइन्फ़ो iMDiya^^info = इंडियइन्फ़ो
Anyways, no point in reviving these almost "dead" trans-coding schemes; unicode is becoming the defacto quite faster now.
no way for the usual HK to represent an a-u or a-i hiatus.
I wonder if it was ever documented by Peter Scharf.
Do not know if Peter Scharf has mentioned about this anywhere (wrt HK notation); but Thomas Malten sure has done so, way back in 1997!!
Look at this-
Incidentally, @funderburkjim also cited to have used thus (2014) at the top of this particular page-
https://github.com/sanskrit-lexicon/CORRECTIONS/issues/10#issuecomment-61421396
@funderburkjim what piece of Python would you use to collect the hiatus that are left legal in MW now?
While checking errors in PWK, I came across titO. It would be a great thing if the list of hiatus provided by Marcis can be checked in all dictionaries to see whether the hiatus is lost by transliteration script. If it is lost - restore it