Closed gasyoun closed 10 years ago
Certainly a good idea to find letter triples. Here's a run in mw.xml, where key1 is slp1. It finds two
Note: for mw.xml, the key is identified as the contents of the key1 element <key1>xxx</key1>.
Here's a grep which searches for any character (.) repeated twice that occurs after <key1> but
before the '<' of </key1>:
grep -E "<key1>[^<]*(.)\1\1" mw.xml
<H4><h><hc3>100</hc3><key1>mukundaBawwwIya</key1><hc1>3</hc1><key2>mukunda--Baw<sr1/>wwIya</key2></h><body> <lex>n.</lex> <c>N._of_<ab>wk.</ab></c> </body><tail><MW>104653</MW> <pc>819,2</pc> <L>164782</L></tail></H4>
<H4><h><hc3>100</hc3><key1>SiNgaBawwwIya</key1><hc1>3</hc1><key2>SiNga--Baw<sr1/>wwIya</key2></h><body> <lex>n.</lex> <c>his_<ab>wk.</ab></c> </body><tail><MW>136337</MW> <pc>1071,1</pc> <L>216821</L></tail></H4>
But, I think you're asking about how this would work with extended ascii characters (such as those with diacritics in IAST).
It probably depends on the program which is doing the regex matching.
I created a testin.txt text file (saved as utf-8 encoding) from your four words, with two additional lines:
mukundabhaṭṭṭīya
mukundabhaṭṭtīya
prapannnāmṛt
mātṛdatttīya
sahacarabhinnnatā
Then the grep command:
grep -E "(.)\1\1" testin.txt > temp
And, here is temp - it correctly picked out the words with triple letters, whether the tripled letter had diacritics or not:
mukundabhaṭṭṭīya
prapannnāmṛt
mātṛdatttīya
sahacarabhinnnatā
A Python script finds the expressions properly also:
python tripletest.py testin.txt testout.txt
6 lines read from testin.txt
4 lines with triples written to testout.txt
and here is testout.txt
Found triple 'ṭṭṭ' in line mukundabhaṭṭṭīya
Found triple 'nnn' in line prapannnāmṛt
Found triple 'ttt' in line mātṛdatttīya
Found triple 'nnn' in line sahacarabhinnnatā
And here is the Python script (I knew you would want to see it):
""" tripletest.py
ejf
Sep 10, 2014
Test to see if python correctly identifies regexs in
text with 'extended ascii', assumed coded as utf-8
Usage: python tripletest.py <inputfile> <outputfile>
"""
import re,sys
import codecs # used to open files as utf-8
def triplefind(filein,fileout):
f = codecs.open(filein,encoding='utf-8',mode='r')
fout = codecs.open(fileout,'w','utf-8')
n = 0
n1 = 0 # # of lines with a triple
for line in f:
line = line.rstrip()
n = n + 1
m = re.search(r'(.)\1\1',line)
if m:
match=m.group(0) # the whole matched group
# construct output
out = "Found triple '%s' in line %s\n" %(match,line)
fout.write(out)
n1 = n1 + 1
f.close()
fout.close()
print "%s lines read from %s" %(n,filein)
print "%s lines with triples written to %s" %(n1,fileout)
if __name__=="__main__":
filein = sys.argv[1]
fileout = sys.argv[2]
triplefind(filein,fileout)
Incidentally, I do not know how to handle utf-8 properly in php. If anyone in this group is comfortable with utf-8 in php, I've got an open question regarding one of the displays (STC) that I'd like to have you look at.
Also, as mentioned elsewhere, I don't know excel.
Great, love the code samples. Shalu, what shall we do with the triples?
Namaste
Here are the corrections--
mukundabhaṭṭṭīya >> mukundabhaṭṭīya
prapannnāmṛta >> prapannāmṛta
mātṛdatttīya >> mātṛdattīya
sahacarabhinnnatā >> sahacarabhinnatā
vāhitttha not sure right now- need checking. Because there can be three 't's because actually "t" represents त, "th" represents थ so in reality "ttth" seems like having three तs, but in reality it is त्,त्,थ् -- So it is a possible combination in संस्कृतम्.
I bow to you. Now it's Jim's turn.
mukundaBawwwIya and SiNgaBawwwIya corrected in MW to have just 'ww'.
prapannnāmṛta , prapannnāmṛta , prapannnāmṛta had already been corrected in MW.
vāhitttha has already been corrected to vāhittha (and checked) in MW.
Sometime, we probably should check all the dictionaries for this type of error.
As per https://github.com/sanskrit-lexicon/CORRECTIONS/edit/master/correctionform.txt, line 135 Q.: status = Corrected Oct 2, 2014. 4 other 'sss' instances. Also errors? A.: no, totally ok in German to have 3 letters of the same type, because they are in composita word: Blassschrift, Genussstätte, Rossschweif, Pressstein. But even then there can be errors, so we want to check all instances of 3 letter in a row (like https://github.com/sanskrit-lexicon/Cologne/issues/48).
Today I saw mukundabhaṭṭṭīya and wondered if we can сheck if 3 letters of same kind together? Sure, as http://stackoverflow.com/questions/21437568/regex-to-find-3-instances-of-letter-in-a-row-php states http://regex101.com/r/kG2xL1/1 can do it. It can, but not fur diacritics - one has to list all of them, only then it will work. So
(\w|ṭ)\1{2}
and in full, non-test form(\w|ā|ī|ū|ṛ|ṝ|ḷ|ṅ|ñ|ṭ|ḍ|ṇ|ś|ṣ|ḥ|ṁ|ṃ|Ā|Ī|Ū|Ṛ|Ṝ|Ḻ|Ṅ|Ñ|Ṭ|Ḍ|Ṇ|Ś|Ṣ|Ḥ|Ṁ)\1{2}
will help find the mukundabhaṭṭṭīya case.MWmukundabhaṭṭṭīya prapannnāmṛta mātṛdatttīya sahacarabhinnnatā
I've got Regex 5.5 library in my Excel turned on, but could not reproduce all the magic there http://stackoverflow.com/questions/22542834/how-to-use-regular-expressions-regex-in-microsoft-excel-both-in-cell-and-loops so I'm asking to test. Good idea, bad idea?vāhitttha
in Apte seems fishy.