Hiatus - Githubissues

drdhaval2785 commented 10 years ago

While checking errors in PWK, I came across titO. It would be a great thing if the list of hiatus provided by Marcis can be checked in all dictionaries to see whether the hiatus is lost by transliteration script. If it is lost - restore it

gasyoun commented 10 years ago

There is no search across dictionaries. When checking for spelling errors we badly need it, that I know. I'm thinking about a .bat shell command, that would download always the latest .zip files from the website, extract the needed .XML files, extract the headwords from .XML files with my .vbee script in EmEditor and compile them together. But the compilation part is where I do not know how to make it tracable after who is who - because just L number is not enough. https://github.com/sanskrit-lexicon/Cologne/issues/43 might be of interest.

drdhaval2785 commented 10 years ago

@funderburkjim It is time to revive this thread. Very important.

funderburkjim commented 10 years ago

@drdhaval2785 What are you needing from me? Also, where is 'list of hiatus provided by Marcis'

drdhaval2785 commented 10 years ago

@funderburkjim - Marcis's hiatus list (Two vowels consecutively) of MW is here. To understand the problem, please see http://www.sanskrit-lexicon.uni-koeln.de/scans/PWScan/2014/web/webtc/indexcaller.php?key=titO&input=slp1&output=SktDevaUnicode

Look at key = titO. The original thing must have been titau. But running transcoders may have converted it to titO unknowingly. These are risk places for any transcoder. As there is an error of titau-> titO in one dictionary PW, then there are chances that such transliteration errors may occur in other dictionaries too.

My request is:

Take the list of the known hiatuses. (e.g. titau)
Check in all dictionaries whether these have been wrongly joined or not. (titO)
If wrongly joined - flag them for scrutiny here.
Whatever are confirmed as wrong - correct them (Correct it to titau)

It is a generic correction. Therefore, I don't want to do manual labor, which you can do with code. Your job is to create the list in point 2.

Hope I am clear this time

gasyoun commented 10 years ago

Dhaval is strict. Jim, hope your pythons are hungry again. I can't hold Dhaval away from inventing new tasks, I can't :o:

funderburkjim commented 10 years ago

@Dhaval - Regarding your point # 2: Is it a matter of extracting the lines of sanhw1.txt that match the words in Marcis's hiatus ?

As background: Regarding why the 'titO' errors occur. In Thomas's digitizations, Thomas represents Devanagari words using the HK transliteration. This HK transliteration represents the two diphthongs

slp1   HK
O       au
E        ai

Thus, there is no way for the usual HK to represent an a-u or a-i hiatus.

I have gotten around this for one or two dictionaries by 'adding a 0' in the HK:

slp1   HK-extended
 au     au0
 ai      ai0

drdhaval2785 commented 10 years ago

@funderburkjim Is it a matter of extracting the lines of sanhw1.txt that match the words in Marcis's hiatus ? No it is a matter of extracting the lines of sanhw1.txt that matches the wrong combination of Marcis's hiatus.

We dont want a list of words like 'titau'. We want a list of words like 'titO' which are potentially wrong.

My suggested steps would be A. Remove the words having no 'au' or 'ai' from Marcis's list. B. For the rest of it, str_replace(array("au","ai"),array("O","E"),$text) , where $text is Marcis's remaining hiatus list. (e.g. converts titau -> titO) C. We will search for this 'titO' kind of list with sanhw1.txt.

This should create our step 2 list

Doable?

gasyoun commented 10 years ago

Let's kill HK. Feed the pythons again :snake:

funderburkjim commented 10 years ago

Here's result of carrying out step2: https://dl.dropboxusercontent.com/u/29859999/hiatusDropbox.zip

Here's the subset of hiatusOccur that probably corresponds to needed corrections:

aDaupAsana:!=:aDOpAsana:GST,PD,PW
uBayataukTya:!=:uBayatOkTya:PW
uBayataHprauga:!=:uBayataHprOga:PW
gaizwi:!=:gEzwi:PW,PWG
titau:!=:titO:BEN,CAE,CCS,GRA,MW72,SHS,VCP,WIL,YAT
dakziRataupavItin:!=:dakziRatOpavItin:PW
durvAsaupAKyAna:!=:durvAsOpAKyAna:PW
namaukti:!=:namOkti:CAE,CCS,GRA,PW,PWG
parauru:!=:parOru:PW,PWG
parauzRih:!=:parOzRih:PW,PWG
purauzRih:!=:purOzRih:CAE,MD,PW,PWG
prauga:!=:prOga:CAE,CCS,MW,MW72,PW,PWG,VCP
praugya:!=:prOgya:PWG
rajaudvAsA:!=:rajOdvAsA:PW
vasyaizwi:!=:vasyEzwi:CAE,CCS,GRA,PWG
Siraupanizad:!=:SirOpanizad:PW
sAtasaikA:!=:sAtasEkA:PW,PWG
hiraRyaprauga:!=:hiraRyaprOga:GRA,PW,PWG

For instance, the first line

aDaupAsana:!=:aDOpAsana:GST,PD,PW

means that the likely incorrect spelling aDOpAsana:occurs in the headwords of three dictionaries.

If you concur with this list, I'll go ahead and develop the corrections.

drdhaval2785 commented 10 years ago

@funderburkjim I concur with the list. It needs correction for sure. Great job.

gasyoun commented 10 years ago

Great as usual.

funderburkjim commented 9 years ago

All these hiatus corrections have been made, along with a few others noticed along the way.

For all the dictionaries involved (except PD) in this hiatus-correction binge, an SLP1 version (X_orig_utf8_slp1.txt) has been made, and things adjusted so this slp1 version is now the base form to be used for further corrections. The programs and peculiarities are discussed in the convertwork/readme.txt file prepared as part of the xml download for each dictionary.

It is planned to do SLP1 conversion for PD and at least the other major dictionaries (notably Apte) which have not been converted. For the time being, the 'minor' dictionaries will remain with their base form as HK, and be converted when our attention is drawn to them.

For reference, here are notes on the hiatus corrections that were made.

--------------------------------------------------
Misspellings for GST DONE Nov 3, 2014 (converted to slp1)
aDaupAsana:!=:aDOpAsana

--------------------------------------------------
Misspellings for PD  DONE Nov 4, 2014 (NOT YET converted to SLP1)
aDaupAsana:!=:aDOpAsana
ALSO:
aDOcCizwa aDOpariguRita aDaHprOgam
Ecadeva EculA Ebuka Evuli

--------------------------------------------------
Misspellings for PW  DONE Nov 7, 2014  converted to slp1
aDaupAsana:!=:aDOpAsana
uBayataukTya:!=:uBayatOkTya
uBayataHprauga:!=:uBayataHprOga
gaizwi:!=:gEzwi
dakziRataupavItin:!=:dakziRatOpavItin
durvAsaupAKyAna:!=:durvAsOpAKyAna
namaukti:!=:namOkti
parauru:!=:parOru
parauzRih:!=:parOzRih
purauzRih:!=:purOzRih
prauga:!=:prOga  : Not an error. prOga is given as a 'bad spelling' of prauga
rajaudvAsA:!=:rajOdvAsA
Siraupanizad:!=:SirOpanizad
sAtasaikA:!=:sAtasEkA
hiraRyaprauga:!=:hiraRyaprOga

Done previously: vasyEzwi -> vasyaizwi , appEdIkzita -> appaidIkzita

--------------------------------------------------
Misspellings for PWG  DONE Nov 6, 2014 converted to slp1
gaizwi:!=:gEzwi 
namaukti:!=:namOkti
parauru:!=:parOru
parauzRih:!=:parOzRih
purauzRih:!=:purOzRih
prauga:!=:prOga
praugya:!=:prOgya
vasyaizwi:!=:vasyEzwi
sAtasaikA:!=:sAtasEkA
hiraRyaprauga:!=:hiraRyaprOga

--------------------------------------------------
Misspellings for BEN  DONE  converted to SLP1 
titau:!=:titO

--------------------------------------------------
Misspellings for CAE DONE convert to slp1 Nov 8
titau:!=:titO
namaukti:!=:namOkti
purauzRih:!=:purOzRih
prauga:!=:prOga
vasyaizwi:!=:vasyEzwi

--------------------------------------------------
Misspellings for CCS  Done Nov 9. Converted to SLP1
titau:!=:titO
namaukti:!=:namOkti
prauga:!=:prOga
vasyaizwi:!=:vasyEzwi

--------------------------------------------------
Misspellings for GRA Done Nov 10, 2014
   Headwords are in AS coding of IAST.
   The SLP1 forms of headwords are computed by hw2.py.
   So, this program is altered for these cases.
   Note, the hw1 form 'pra4uga' gets computed properly by accident (to prauga)
   since a4 converts to 'a' in as_slp1.xml.

   There is no question here of converting HK to SLP1, since there is no HK.

titau:!=:titO
namaukti:!=:namOkti
vasyaizwi:!=:vasyEzwi
hiraRyaprauga:!=:hiraRyaprOga

--------------------------------------------------
Misspellings for MW72  Done Nov 10. Also, convert to SLP1
titau:!=:titO
prauga:!=:prOga

--------------------------------------------------
Misspellings for SHS  DONE Nov 11. Converted to SLP1
titau:!=:titO

--------------------------------------------------
Misspellings for VCP  DONE (already converted to SLP1)
titau:!=:titO
prauga:!=:prOga

--------------------------------------------------
Misspellings for WIL  Nov 14, 2014 done. convert to SLP1
titau:!=:titO

--------------------------------------------------
Misspellings for YAT  Nov 16 Done. Converted to SLP1
titau:!=:titO

--------------------------------------------------
Misspellings for MD  Nov 16 Done. Converted to SLP1
purauzRih:!=:purOzRih

--------------------------------------------------
Misspellings for MW DONE  (already converted to SLP1)
prauga:!=:prOga   NOT changed.  'prauga' is a separate headword.
   For this prOga, the defn is w.r. for prauga.  HOWEVER, the 
   definition IS wrongly spelled, and is corrected.

@drdhaval2785 I'll let you close this issue if you think it appropriate.

drdhaval2785 commented 9 years ago

@funderburkjim Kudos!

gasyoun commented 3 years ago

no way for the usual HK to represent an a-u or a-i hiatus.

I wonder if it was ever documented by Peter Scharf.

Andhrabharati commented 3 years ago

Wiki says thus-

Harvard-Kyoto > Conversion to Devanagari

Sanskrit text encoded in the Harvard-Kyoto convention can be unambiguously converted to Devanāgarī, with two exceptions: Harvard-Kyoto does not distinguish अइ (a followed by i, in separate syllables, i.e. in hiatus) from ऐ (the diphthong ai) or अउ (a followed by u) from औ (the diphthong au). However such a vowel hiatus extremely rarely would occur inside words. Such a hiatus most often occurs in sandhi between two words (e.g. a sandhi of a word ending in 'aH' and one beginning with 'i' or 'u'). Since in such a situation a text transliterated in Harvard-Kyoto would introduce a space between the 'a' and 'i' (or 'a' and 'u') no ambiguity would result. ------------------- Probably an underscore could also be used; or now that we have unicode, this is the simplest solution-

If two English characters are making one Devanagari vowel (ex: ai, ou), then, ZWJ or ZWNJ character can be used to separate them into different vowels.

Example: iMDiyainfo = इंडियैन्फ़ो iMDiya^info = इंडिय‍इन्फ़ो iMDiya^^info = इंडिय‌इन्फ़ो

Andhrabharati commented 3 years ago

Anyways, no point in reviving these almost "dead" trans-coding schemes; unicode is becoming the defacto quite faster now.

Andhrabharati commented 3 years ago

    no way for the usual HK to represent an a-u or a-i hiatus.

I wonder if it was ever documented by Peter Scharf.

Do not know if Peter Scharf has mentioned about this anywhere (wrt HK notation); but Thomas Malten sure has done so, way back in 1997!!

Look at this-

Incidentally, @funderburkjim also cited to have used thus (2014) at the top of this particular page-

https://github.com/sanskrit-lexicon/CORRECTIONS/issues/10#issuecomment-61421396

gasyoun commented 3 years ago

@funderburkjim what piece of Python would you use to collect the hiatus that are left legal in MW now?

sanskrit-lexicon / CORRECTIONS

Hiatus #10

Harvard-Kyoto > Conversion to Devanagari