Closed funderburkjim closed 2 years ago
Dear Jim, Thanks for the Step 1! I will focus on this and will try to find correct answers tomorrow.
- in the example, what are the two things?
1) The 1st thing is "hasa" - the slp1 spelling of the Devanagari spelling in the MD dictionary. The 2nd thing is "hás-a ({%or%} á)" - IAST spelling in the MD dictionary.
2) To identify those two things in the text we can use ":".
Perfect! 👍
Now, what is the Python way to use ':' to split the given text into those two pieces?
Answer: there are 2 straightforward ways to do this string manipulation in Python
text.split(':')
re.split(':',text)
There are many good tutorial resources online to explain basic concepts in Python.
One resource that may be helpful here is https://www.w3schools.com/python/.
This website explains simple concepts and provides a 'playground' where you can experiment.
Some top-level topics relevant to splitting strings are:
@AnnaRybakovaT So experiment some with these topics. When you feel comfortable with using Python to split our text into a list with two parts, we'll then think how to fit this into a step1/readwriteA2.py program.
One resource that may be helpful here is https://www.w3schools.com/python/. Dear Jim, Thank you very much! This resource is amazing!!!
Regarding x = [1,'apple','pie'] There is the list which consists 3 items: x[0] =1, x[1] ='apple', x[2] ='pie', as I understand x[3] doesn't exist in this list
len(x) - is a function which returns the number of items in an object. In our list "x" the result of this function is 3.
Looks like you've got the idea of list. When you've got the idea of split(), Go ahead and:
Dear Jim, I need your help, please!
1) I made a new directory 'step1' (as I thought) After this command a folder Step 1 was created:
2) But when I wanted to copy the file readwriteA1.py I had message that "step 1 is not directory"
Where is my mistake?
the 'cp' (copy) command normally takes 2 arguments, not 3 arguments:
cp path-to-old-file path-to-copy-file
.
But, as with many unix commands, there are various ways to use the command, so your particular
usage was interpreted as cp first-old-file second-old-file target-directory
.
So, reformulate your command to use the normal 2 arguments.
That's the first point. Now, the second point is related to details regarding how to specify the 'path-to-old' and 'path-to-new'. These paths are generally relative to your 'current directory'.
Your location of your current directory is shown in the git bash prompt as
~/Documents/sanskrit-lexicon/MD/deva_iast_comp/step0
, or, informally, as step0.
[This can also be found by the Unix command pwd
; "pwd = print working directory"].
In the next comments, I'll assume that your current directory is step0.
Where is 'old-file' relative to current directory? Well, that's easy -- readwriteA1.py is in step0.
You can check this by the ls
command ("ls" = "list directory contents."). Try it!
so the first part of the desired cp command is as you wrote it cp readwriteA1.py path-to-new
.
Now, suppose you used the command cp readwriteA1.py readwriteA2.py
-- what would this do.
Give it a try, and then do an ls
.
You don't really want readwriteA2.py to be in the step0 directory.
Use the 'rm' command to delete the unwanted file ("rm" = "remove").
Recall the discussion in #3 about '../'. Try the 'cp' command again with 'path-to-new' as
<something>readwriteA2.py
. (
Do an ls. Do you find readwriteA2.py ? If not, where is it? Do a 'cd' command to get your current directory to be 'step1'. Then do an 'ls' -- Have you found readwriteA2.py now?
Try various 'cp' and 'ls' commands and 'cd' commands. Remember you can use 'rm' commands to get rid of unwanted copies.
You can Google searches such as 'unix cp command' to get further information (sometimes more than you want!). One source that turned up for me was https://www.tutorialspoint.com/unix_commands/cp.html
and similar for others. But you really won't need but a few Unix commands, most of which are mentioned above.
Dear Jim, Thank you so much!
cp readwriteA1.py readwriteA2.py
-- what would this do
This command will create the file readwriteA2.py in the current directory. This I understood from the beginning and for this reason I was thinking how to create a new file in another directory. Thanks to your explanations the solusion has found now. There is a command: cp readwriteA1.py ../step1/readwriteA2.py
- Now modify readwriteA2.py so it outputs the split lines. There will be some choices regarding what should go into 'newlines'
Dear Jim, I was trying to modify this file by different ways, unfortunately still without result. I have to learn more about Python and I hope in a while I will find a solution. Otherwise I will explain you my ideas and you will guide me to find a correct path.
Your revised cp command is just right!.
I'll wait until you request another hint for readwriteA2.py.
Dear Jim, I have jast finished analyzing the words_mw_noneng.txt and now I can more focus on Python commands. I hope tomorrow or the day after tomorrow I will be able to continue the Step 1.
- Now modify readwriteA2.py so it outputs the split lines. There will be some choices regarding what should go into 'newlines'.
Dear Jim, Finally - I need your help. As I guess I should start from updating the function adjustlines(lines). We can split every line by: newline = line.split(":") or newline = re.split(":", line) But here we receive not line but a list of 2 new lines. In this case probably should be: newlines = line.split(":") or newlines = re.split(":", line)
In this case, the updated part of the program looks like this: def adjustlines(lines): newlines = [] for line in lines: newlines = line.split(":") newlines.append(newlines) return newlines
Probably I miss something important regarding list and stripe/line (see the Error massage).
If I run this program – the result is a file readwrite A2.text but only with the 1st splitted line.
Dear Jim, Maybe I have found solution. Finally the function looks like this :
def adjustlines(lines): newlines = [] for line in lines: x1 = line.split(":") newline1 = x1[0] x2 = line.split(":") newline2 = x1[1] newlines.append(newline1) newlines.append(newline2) return newlines
And the result is this list:
Your revised form is certainly one possibility, out of many possibilities.
(minor Note: you actually don't need 'x2').
In fact, we don't at this stage know just what will be the best output, because we are only at the beginning stage of analysis.
I'll dream up a couple of other possibilities, just to show you additional useful techniques.
In the meantime,
Here's what looks like a useful next step in our analysis: We are wanting to compare the spellings of 'newline1' (the slp1 spelling of headword) with the IAST spelling that (sometimes) appears at the beginning of newline2. But to do that, we must get rid of junk at the end of newline2.
Look at the data and describe in words what we need to get rid of in newline2 in order just to be left with the IAST. Go ahead and post your answer in a comment.
Python often touts itself as the programming language with batteries included, which means it includes many modules with specialized capabilities to help the programmer solve common problems. One of the modules used in text processing is the 'regular expression' module.
A program that uses the regex module must import it, by import re
. You'll see that in our readwrite program,
the module has been imported (import sys,re,codecs
which imports three modules).
In fact, you've already seen one usage of regex module in 're.split'.
Now, I'm pretty sure that we can use re.sub (regular expression substitution) to get rid of the junk.
Essentially we will use something like newline3 = re.sub(JUNK,'',newline2)
to replace JUNK in newline2 with an empty string. Here JUNK is a regular expression pattern that describes the portion of the text of newline2 that we want to remove.
So once we have an answer to question 1, our task reduces to translating the answer into a regular expression pattern (or perhaps our problem will require more than one pattern).
You can get started learning about regular expressions with online tutorials such as https://www.w3schools.com/python/python_regex.asp.
In your example above you showed your revised adjustlines function, but note that the indentation is lost. If you 'edit' the comment, the Python indentation is present. If you precede and follow a chunk of text with triple back-quote, then the indentation is retained. Next I have copy-pasted your function code and put it in triple back-quotes:
def adjustlines(lines):
newlines = []
for line in lines:
x1 = line.split(":")
newline1 = x1[0]
x2 = line.split(":")
newline2 = x1[1]
newlines.append(newline1)
newlines.append(newline2)
return newlines
- (in two readme files, one old and one new
Dear Jim, As I see - I should create one new file readme.txt in /step 1 and update the old file in /deva_iast_comp (not in /step0). Is it correct?
(minor Note: you actually don't need 'x2')
Just for curiosity I ran the program readwriteA3_test.py where "x2" was deleted. The output (readwriteA3_test.txt) consist only the 1st parts of lines (befor ":"). Could you check this program and identify my mistake, please.
Question 1: What is junk here?
There are some examples from lines 2 ághnya ({%also%} {@-yá@}) á-dṛp-ita {%or%} {@-ta áhas = áhar
First of all we should delete data after the main word. For this possible to use a function re.split with some patterns "(", "{" and " ". Just now I know only this function but as I see my next task is re.sub
In next step we should delete "-" inside of our words.
And finally we don't need accent marks.
Question 1: What is junk here?
As well there are some words with junk ("~" and "") in front: ~naṣ-ṭa nāgarī
in two readme files, one old and one ...
Yes, that was the idea.
readwriteA3_test.txt
readwriteA2.py has:
x1 = line.split(":")
newline1 = x1[0]
x2 = line.split(":") # only this line is extraneous
newline2 = x1[1]
# We want to add the new line to our list of new lines.
# 'append' is the way to do that
newlines.append(newline1)
newlines.append(newline2)
while readwriteA3_test.py has
x1 = line.split(":")
newline1 = x1[0]
# We want to add the new line to our list of new lines.
# 'append' is the way to do that
newlines.append(newline1)
builtin
?Noticed you changed 'builtin' to 'builting' or maybe it was 'building' in a comment (e.g., line 36 of readwriteA2.py).
There is a building way to split strings into a list (separator is ":")
should be
There is a builtin way to split strings into a list (separator is ":")
'builtin' (or 'built-in') appears to be a somewhat technical word, which should in this context be used in place of 'building' or 'builting' (I don't think 'builting' is in the English lexicon).
In this case, the sense is that we don't have to write a function to split a string into a list of substrings. Instead, the python distribution already has solved this problem -- that is, there is a solution already 'built into Python'. Or equivalently, there is a Python 'builtin' solution. All we have to do is learn how to use one of the string splitting tools built into python.
First of all we should delete data after the main word.
That looks promising. It looks to me that the 'main word' (in newline2) never contains a space character. If this is true for our ../data.txt, then we can say that 'newline3' (which is to contain only the main word from newline2) is defined by removing the first space character plus all subsequent characters in newline2. This looks almost right, but there are some cases where there is NO space character, such as the 'abnormal' lines.
Informally, we can say
Actually, from this informal statement, I think we can get newline3 by using another split on newline2. @AnnaRybakovaT do you see how to use split to get newline3? Give it a try. Then we'll try to develop a similar solution using regular expressions. Write a readwriteA3.py (or whatever you want to call the program).
Sometimes, when you want to experiment with a small bit of code, it is helpful to use python interactively. To do this, just type 'python' (return) in the terminal. Here's what it looks like:
$ python
Python 3.9.1 (tags/v3.9.1:1e5d33e, Dec 7 2020, 17:08:21) [MSC v.1927 64 bit (AM
D64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>>
There is a blinking vertical line after '>>> ' indicating the program is waiting for you to type something.
The first thing to do is learn how to exit the interactive python session.
One way to do this by typing 'quit()'.
$ python
Python 3.9.1 (tags/v3.9.1:1e5d33e, Dec 7 2020, 17:08:21) [MSC v.1927 64 bit (AM
D64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> quit()
jimfu@DESKTOP-6PTUC6R MINGW64 /c/xampp/htdocs/sanskrit-lexicon/md/deva_iast_comp/step1 (master)
$
Now, you're back to the git bash terminal.
You can also do 'python -i' instead of 'python'.
Python 3.9.1 (tags/v3.9.1:1e5d33e, Dec 7 2020, 17:08:21) [MSC v.1927 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> x = 'roses are red and violets are blue'
>>> x.split(' ')
['roses', 'are', 'red', 'and', 'violets', 'are', 'blue']
>>> x = 'antidisestablishmentarianism' # yes, that is a word!
>>> x.split(' ')
['antidisestablishmentarianism']
>>> quit()
Try to use some of the lines in ../data.txt as x-es., and try to use split to get newline3. When you've got it perfected in interactive session, then you'll be ready to write readwriteAX.py
It's optional whether to use the interactive python.
Sometimes, instead, I'll write a short temp.py program as another way to test out an idea.
Or I'll write a test() function in the program under development.
Some people might also use the google colab for similar testing.
You'll develop your own preferences as time goes by.
only this line is extraneous
Dear Jim, Thanks a lot! Now I understood.
builtin' (or 'built-in'
Thanks! I will correct now the files.
This looks almost right, but there are some cases where there is NO space character, such as the 'abnormal' lines.
Dear Jim, I was thinking how to split lines2, but I have no other ideas exept using the space character. I ran this test program:
def adjustlines(lines):
newlines = []
for line in lines:
x1 = line.split(":")
newline1 = x1[0]
x2 = line.split(":")
newline2 = x1[1]
x3 = newline2.split()
newline3 = x3[0]
newlines.append(newline1)
newlines.append(newline2)
newlines.append(newline3)
return newlines
Output (readwriteA3_test.txt) is not so bad, since new lines3 include:
In your comment, you should
Why don't you edit the above comment with these changes. -- do you see the difference?
do you see the difference?
Many thanks! I realized my mistake - the keywords are on separate line!!!
Your solution re splitting on space looks fine. you used newline2.split()
where you use the default argument for split. I think this usage means split on any character which is considered to be 'white space'.
(Ref: https://www.w3schools.com/python/ref_string_split.asp).
I would probably have used newline2.split(" ")
which would have split only on space character.
But in the context of this program, the two probably give the same result.
We now have 3 lines of output for each line of input. And the .txt file is getting a bit hard to read. Suggestions:
Do you think these changes make the output a bit easier to read?
%s
?There are many ways to construct strings. One very flexible way involves '%s'. This is currently considered 'old-fashioned' in Python, but I still use it a lot. Try this reference for an introduction https://www.learnpython.org/en/String_Formatting).
Here is example of how the 3rd line might appear based on the above 'suggestion for output'
-------------------------------------
orig = aGnya:ághnya ({%also%} {@-yá@})
slp1 = aGnya
rest = ághnya ({%also%} {@-yá@})
iast = ághnya
And the .txt file is getting a bit hard to read.
I absolutely agree. I will try to update the output on Monday (probably tomorrow I will not be in front of computer).
- add a separate spacing line (such as a line of '-' )BEFORE adding newline1
Dear Jim,
You can check the updated output (file readwriteA3.txt).
For adding a separate spacing line i used this command: newlines.append('%s' %"-----------------------")
Regarding this I have a question. I am curious if exists more simple way to put "-" during the all length of a string or just put a number - how many times we wanna to appear "-"?
readwriteA3.txt looks fine.
newlines.append('%s' %"-----------------------")
This is ok but awkward, In Fact if x is any string, and y is the string "%s" % x
then x and y are equal string. So newlines.append("-----------------------")
gives the same result.
$ python -i
Python 3.9.1 (tags/v3.9.1:1e5d33e, Dec 7 2020, 17:08:21) [MSC v.1927 64 bit (AM
D64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> x = "abcd"
>>> y = "%s" % x
>>> x == y
True
Yes, there is a Python way to do this.
x*n
where x
is a string, and n
is a positive integer; the result is a string comprised
of n copies of x.
$ python -i
Python 3.9.1 (tags/v3.9.1:1e5d33e, Dec 7 2020, 17:08:21) [MSC v.1927 64 bit (AM
D64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> '-'*10
'----------'
>>> 'ab'*5
'ababababab'
>>> quit()
e.g., set new variable newline3a to be newline with the '-' characters replaced by empty string ''.
You can use the 'replace' method for strings. Research this by searching python string replace
.
We don't really need to have both newline3 and newline3a, so just have your output use newline3a.
Ultimately, each of the 'abnormal' items will need to be examined individually in md.txt to see if the slp1 and iast are consistent. So it is of interest to know how many of these there are.
One programming way to count the number of abnormal items is to
nabnormal
. We need to initialze this to zero (0) BEFORE the loop in adjustlines function.if
clause to test if the line is abnormal, and if it is, then we need to increment our counter nabnormal = nabnormal + 1
There are 25 lines marked abnormal
.We ultimately want to compare the iast from our file with the slp1. One way to do this is to convert the slp1 to iast (save the result in some variable, such as slpiast). Such a conversion (from the slp1 transcoding of Sanskrit to the iast transcoding of Sanskrit) might be called a transliteration.
Sanskrit transliteration is a specialized functionality that is NOT built into Python. Thus we either need to write the necessary functionality ourselves or use an implementation by someone else.
Luckily, there are already ways to convert slp1 to iast.
Let's use the transliteration library that @drdhaval2785 prefers. We will use the 'pip' tool to install a package (See https://www.w3schools.com/python/python_pip.asp for brief general intro to using pip).
pip install indic_transliteration
## This will print a bunch of information to terminal, which you generally can ignore.
When the installation is done, we can test it out:
$ python -i
Python 3.9.1 (tags/v3.9.1:1e5d33e, Dec 7 2020, 17:08:21) [MSC v.1927 64 bit (AM
D64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> from indic_transliteration import sanscript
>>> sanscript.transliterate('fziH','slp1','iast')
'ṛṣiḥ'
>>> sanscript.transliterate('rAma','slp1','iast')
'rāma'
>>> quit()
For the next task, generate one more line in the next version of readwrite to compute and show slpiast.
A search will take you to pypi, then click on the homepage. which leads to https://github.com/indic-transliteration/indic_transliteration_py.
@koleslena and @OlgaSoloveva and @DomiCheck and @VladimirWl - does it makes sense to you?
@koleslena and @OlgaSoloveva and @DomiCheck and @VladimirWl - does it makes sense to you? Yes, it makes.
does it makes sense to you?
I know all of these except indic_transliteration, installed it and tried to use, had some problems
I can help with indic_transliteration. You may post the question and expected outcome. Will be able to guide where it goes wrong
Yes, there is a Python way to do this.
Dear Jim, Thanks a lot! Now I can update this command.
newline3a to be newline with the '-' characters replaced by empty string ''.
I updated iast strings. As well 3 more characters ('~', "*", "[a]") replaced by empty string. As you see, I used the function replace step by step some times:
newline3a = newline3.replace("-", "")
newline3b = newline3a.replace("~", "")
newline3c = newline3b.replace("*", "")
newline3d = newline3c.replace("[a]", "")
Could I did it more easy?
One programming way to count the number of abnormal items is to
Dear Jim, Could you explain, please, where I should place this new adjustlines function? Should it be in our program or it is one separated program?
When the installation is done, we can test it out:
Dear Jim, The installation is done but something is wrong.
Probably:
In the question above, the aim was to replace several characters with the empty string. Using a sequence of string replacements is one valid way, as shown above.
There is another way using regular expressions. Here is a silly example, whick replaces in the string 'x' , any character to 'r', provided that character matches 'n' or 'c'.
python -i
>>> import re
>>> x = 'funny cat'
>>> re.sub(r'[nc]','r',x)
'furry rat'
>>> quit()
Make a variation of the program using re.sub. Then convince yourself that your new program gives exactly the same output as before, compare the old and new output files using the
'diff' unix command. e.g. diff <old output file> <new output file>
.
This command should give NO OUTPUT,
https://www.w3schools.com/python/python_regex.asp Take a few minutes to review the 'metacharacters' section. In the example above the '[' and ']' are metacharacters.
Don't worry about why there is an 'r' in r'[nc]'
(This is called a 'raw string'). Try the example with `'[nc]' instead -- any difference? I suspect no difference in this case. 'raw string' usage is
somewhat complicated.
Regular expressions are powerful (both for searching and for replacing). There is a steep learning curve, but you don't need to know everything about regexes to use them.
Just modify the given adjustlines function.
for line in lines:
to initialize nabnormal to zero.The pip install message gave a 'WARNING ...' which suggests you to update pip. You can do this (with the command provided in the WARNING) if you want to. Usually it is not necessary to update pip.
As a general rule, WARNING messages in pip do NOT indicate that anything went wrong with the installation. If something did go wrong, you will see and 'error' message. For instance
$ pip install abracadabraxxx
ERROR: Could not find a version that satisfies the requirement abracadabraxxx
ERROR: No matching distribution found for abracadabraxxx
WARNING: You are using pip version 21.0.1; however, version 22.0.2 is available.
You should consider upgrading via the 'c:\users\jimfu\appdata\local\programs\python\python39\python.exe -m pip install --upgrade pip' command.
Your installation is ok. The funny looking \u1e5b ...
is due to an oddity of the print function Python in conjunction with the Git Bash terminal.
This problem occurs with 'print(x)' when 'x' is a string containing non-ascii characters (the ascii characters are the 'usual' Latin alphabet including digits and punctuation ).
If you do the same test with the 'cmd' terminal of windows, you likely won't see those \u
representations of unicode characters.
In my Windows installation of Git Bash, I have two things in my '.bashrc' configuration file.
You can do this also. In my computer, the .bashrc file is at path c:/Users/jimfu/.bashrc
Probably yours is similarly located, but at your Windows user name (Rybakova instead of jimfu). It is possible that this file does not exist; in that case just create one. It is a text file.
Put this line into the .bashrc file.
alias python='winpty python.exe'
Save .bashrc, open a new GitBash terminal window and try the example again.
Does this solve the problem with the \u...
?
Note: When you write to a file opened as in the readwrite program, then you will NOT see this problem. You could make a simple test program:
# coding=utf-8
"""temp_translit.py
USAGE: python temp_translit.py temp_translit.txt
Tests indic_transliteration module
"""
from __future__ import print_function
import sys,re,codecs
from indic_transliteration import sanscript
if __name__=="__main__":
fileout = sys.argv[1] # word frequency
lines = []
lines.append(sanscript.transliterate('fziH','slp1','iast'))
lines.append(sanscript.transliterate('rAma','slp1','iast'))
with codecs.open(fileout,"w","utf-8") as f:
for line in lines:
f.write(line+'\n')
Then run the program, and check the output file. Does the output look right?
This continues the programmatic analysis of differences between two 'things':
See the discussion at https://github.com/sanskrit-lexicon/csl-orig/issues/628.
Review the comment regarding analysis of the case
hasa:hás-a ({%or%} á)
.Before applying Python tools, we need to informally answer two questions:
hasa:hás-a ({%or%} á)
. What detail of this text can we use to identify those two things?@AnnaRybakovaT What is your answer to these two questions?