sanskrit-lexicon / COLOGNE

Development of http://www.sanskrit-lexicon.uni-koeln.de/
18 stars 3 forks source link

Find RegEx Cases /([a-zA-Z])\1{2}/ #48

Closed gasyoun closed 10 years ago

gasyoun commented 10 years ago

Today I saw mukundabhaṭṭṭīya and wondered if we can сheck if 3 letters of same kind together? Sure, as http://stackoverflow.com/questions/21437568/regex-to-find-3-instances-of-letter-in-a-row-php states http://regex101.com/r/kG2xL1/1 can do it. It can, but not fur diacritics - one has to list all of them, only then it will work. So (\w|ṭ)\1{2} and in full, non-test form (\w|ā|ī|ū|ṛ|ṝ|ḷ|ṅ|ñ|ṭ|ḍ|ṇ|ś|ṣ|ḥ|ṁ|ṃ|Ā|Ī|Ū|Ṛ|Ṝ|Ḻ|Ṅ|Ñ|Ṭ|Ḍ|Ṇ|Ś|Ṣ|Ḥ|Ṁ)\1{2} will help find the mukundabhaṭṭṭīya case.MW mukundabhaṭṭṭīya prapannnāmṛta mātṛdatttīya sahacarabhinnnatā I've got Regex 5.5 library in my Excel turned on, but could not reproduce all the magic there http://stackoverflow.com/questions/22542834/how-to-use-regular-expressions-regex-in-microsoft-excel-both-in-cell-and-loops so I'm asking to test. Good idea, bad idea? vāhitttha in Apte seems fishy.

funderburkjim commented 10 years ago

Certainly a good idea to find letter triples. Here's a run in mw.xml, where key1 is slp1. It finds two

Note: for mw.xml, the key is identified as the contents of the key1 element <key1>xxx</key1>.
Here's a grep which searches for any character (.) repeated twice that occurs after <key1> but
before the '<' of </key1>:

 grep -E "<key1>[^<]*(.)\1\1" mw.xml
<H4><h><hc3>100</hc3><key1>mukundaBawwwIya</key1><hc1>3</hc1><key2>mukunda--Baw<sr1/>wwIya</key2></h><body> <lex>n.</lex>  <c>N._of_<ab>wk.</ab></c>  </body><tail><MW>104653</MW> <pc>819,2</pc> <L>164782</L></tail></H4>
<H4><h><hc3>100</hc3><key1>SiNgaBawwwIya</key1><hc1>3</hc1><key2>SiNga--Baw<sr1/>wwIya</key2></h><body> <lex>n.</lex>  <c>his_<ab>wk.</ab></c>  </body><tail><MW>136337</MW> <pc>1071,1</pc> <L>216821</L></tail></H4>

But, I think you're asking about how this would work with extended ascii characters (such as those with diacritics in IAST).

It probably depends on the program which is doing the regex matching.

I created a testin.txt text file (saved as utf-8 encoding) from your four words, with two additional lines:

mukundabhaṭṭṭīya
mukundabhaṭṭtīya

prapannnāmṛt
mātṛdatttīya
sahacarabhinnnatā

Then the grep command:

grep -E "(.)\1\1" testin.txt > temp

And, here is temp - it correctly picked out the words with triple letters, whether the tripled letter had diacritics or not:

mukundabhaṭṭṭīya
prapannnāmṛt
mātṛdatttīya
sahacarabhinnnatā

A Python script finds the expressions properly also:

python tripletest.py testin.txt testout.txt
6 lines read from testin.txt
4 lines with triples written to testout.txt

and here is testout.txt

Found triple 'ṭṭṭ' in line mukundabhaṭṭṭīya
Found triple 'nnn' in line prapannnāmṛt
Found triple 'ttt' in line mātṛdatttīya
Found triple 'nnn' in line sahacarabhinnnatā

And here is the Python script (I knew you would want to see it):

""" tripletest.py
    ejf
    Sep 10, 2014
    Test to see if python correctly identifies regexs in
    text with 'extended ascii', assumed coded as utf-8
    Usage: python tripletest.py <inputfile> <outputfile>
"""
import re,sys
import codecs   # used to open files as utf-8

def triplefind(filein,fileout):
 f = codecs.open(filein,encoding='utf-8',mode='r')
 fout = codecs.open(fileout,'w','utf-8')
 n = 0
 n1 = 0 # # of lines with a triple
 for line in f:
  line = line.rstrip()
  n = n + 1
  m = re.search(r'(.)\1\1',line)
  if m:
   match=m.group(0) # the whole matched group
   # construct output
   out = "Found triple '%s' in line %s\n" %(match,line)
   fout.write(out)
   n1 = n1 + 1
 f.close()
 fout.close()
 print "%s lines read from %s" %(n,filein)
 print "%s lines with triples written to %s" %(n1,fileout)
if __name__=="__main__":
 filein = sys.argv[1]
 fileout = sys.argv[2]
 triplefind(filein,fileout)

Incidentally, I do not know how to handle utf-8 properly in php. If anyone in this group is comfortable with utf-8 in php, I've got an open question regarding one of the displays (STC) that I'd like to have you look at.

Also, as mentioned elsewhere, I don't know excel.

gasyoun commented 10 years ago

Great, love the code samples. Shalu, what shall we do with the triples?

Shalu411 commented 10 years ago

Namaste Here are the corrections-- mukundabhaṭṭṭīya >> mukundabhaṭṭīya prapannnāmṛta >> prapannāmṛta mātṛdatttīya >> mātṛdattīya sahacarabhinnnatā >> sahacarabhinnatā

vāhitttha not sure right now- need checking. Because there can be three 't's because actually "t" represents त, "th" represents थ so in reality "ttth" seems like having three तs, but in reality it is त्,त्,थ् -- So it is a possible combination in संस्कृतम्.

gasyoun commented 10 years ago

I bow to you. Now it's Jim's turn.

funderburkjim commented 10 years ago

mukundaBawwwIya and SiNgaBawwwIya corrected in MW to have just 'ww'.

prapannnāmṛta , prapannnāmṛta , prapannnāmṛta had already been corrected in MW.

vāhitttha has already been corrected to vāhittha (and checked) in MW.

Sometime, we probably should check all the dictionaries for this type of error.

gasyoun commented 10 years ago

As per https://github.com/sanskrit-lexicon/CORRECTIONS/edit/master/correctionform.txt, line 135 Q.: status = Corrected Oct 2, 2014. 4 other 'sss' instances. Also errors? A.: no, totally ok in German to have 3 letters of the same type, because they are in composita word: Blassschrift, Genussstätte, Rossschweif, Pressstein. But even then there can be errors, so we want to check all instances of 3 letter in a row (like https://github.com/sanskrit-lexicon/Cologne/issues/48).