Closed funderburkjim closed 1 year ago
Get latest version of this pwk repository from github:
git clone X
where X is here:
Similarly, get up-to-date local installation of sanskrit-lexicon/csl-orig repository.
@AnnaRybakovaT I have prepared a 'sample' for the change. It just changes 'Pron.' to 'Pronoun' Files are in the pwk/pwsissues/issue91 folder. Once you are set up as above, try
python change_sample.py temp_pw_0.txt tempanna_change_sample.txt
# then
diff change_sample.txt tempanna_change_sample.txt
# there should be no output (meaning the files are the same)
At this point, make a copy:
cp change_sample.py change_1.py
Then modify the change_1.py to do what we really want to do here.
python change_1.py temp_pw_0.txt change_1.txt
When done, modify readme.txt to say what was done. Then push the revised pwk repository.
Good luck!
@AnnaRybakovaT I got a notification of an email from you from this issue, part of which said that
that you got an error when trying
python change_sample.py temp_pw_0.txt tempanna_change_sample.txt
because the file temp_pw_01.txt was not found.
Looks like I forgot to say something.
Here's what to do:
Oops. I goofed again.
Abvoe, I should have said
copy that file to your copy of pwk/pwkissues/issue91/temp_pw_0.txt
I think my comments in readme.txt of issue91 directory are correct regarding temp_pw_0.txt.
I did make one change in readme.txt in pwk/pwkissues/issue91, So please pull pwk again first.
Jim
I got a notification of an email from you from this issue, part of which said that that you got an error when trying
Dear Jim, Yes, in the beginning I had this problem but after reading the readme.txt (better to do it always first of all) it was solved.
So, I need help. After long pause I was trying to refresh my knowledge but it was not enough. There is my logical chain: 1) there is a correction which has to be done
old: <div n="1">— 1) {%erfreuen. {#anumodita#} erfreut.%}
new: <div n="1">— 1) {%erfreuen.%} {#anumodita#} {%erfreut.%}
2) for identification of lines with italic Devanagari we can use regular expressions and metacharacters. For searching you have used this expression: "{%[^%]+{#" Could you explain me each symbol, please? As I see there are two conditions {%[^% and {#. I can't understand this symbol at all "[". As I know symbol "^" = starts with
3) Is it possible to generate changes just by line.replace or re.sub (if by metacharacters we will explain where and how exactly we wanna to make changes)?
OK. This is a question of how to do string replacements with the 're' (regular expression) module of Python.
Regular expressions can be quite complicated, and our current task is moderately advanced. Lets use the current task as motivation for your becoming self-sufficiant with regular expressions in general, and more specifically in Python.
I assume you know the basics of regular expressions and how to use
re.search, re.sub, re.findall in Python?
Many on-line resources for beginning tutorials on Python regular expressions.
For example, https://www.w3schools.com/python/python_regex.asp .
In a regular expression pattern, usually [X]
matches a set of characters
For example, [dh]
is a set of two specifiic characters.
X can also be written to match a range of characters: [b-f]
matches 5 characters.
And there are a few other details regarding how X is interpreted.
The ^
character does usually mean 'starts with' in a regular expression pattern.
However, ^
has a different meaning when it is the first character in a pattern representing a set of characters. i.e.,
[^X]
matches any set of characters except those in X.
So [^dh]
matches any characters except 'd' and 'h'.
and [^%]
matches any character except '%'.
Symbolically, we can represent our task approximately as
{%X{#Y#}Z%} -> {%X%}{#Y#}{%Z%}
When you're ready, we'll think together further about our task. Questions always welcome at any point.
{%X{#Y#}Z%} -> {%X%}{#Y#}{%Z%}
Dear Jim, Thank you very much for explanations. I realised my missunderstanding redarding set of characters [ ].
If I wanna to use re.search, I know what exactly we would like to replace: {%X{#Y#}Z%} -> {%[^%]+{#.+#}
BUT I have problems with the second component {%X%}{#Y#}{%Z%} -> ?
newline = re.sub(r'{%[^%]+{#.+#}', '?', line)
In the beggining I wanted to use one more regular expression (something like this): ? -> {%.%}+{#.+#}{% but it doesn't work
Could you kindly give me some advices?
A name for the 2nd argument of re.sub (the one you call '?') seems to be the repl argument. A name for the 1st argument might be the 'match pattern' or just the pattern argument. A near solution to our problem can be found by using two features:
() Capture and group
in the pattern to identify X, Y, and ZSuppose we wanted to change 'pXt' to 'bXt' in any string, where X is a vowel. So pit -> bit, pat -> bat, etc.
import re
pattern = 'p([aeiou])t'
repl = r'b\1t' # r means 'raw' string See note below
old = 'Please pet the dog'
new = re.sub(pattern,repl,old)
print(new) # Please bet the dog
old1 = 'Where is the pit?'
new1 = re.sub(pattern,repl,old1)
print(new1) # Where is the bit?
The '\1' in repl refers to the first matching group in pattern.
In our example there is only one matching group.
So what is the reason for the r
in repl?
Try the example without the 'r'. You will see that the vowel is not properly
represented in 'new'. The reason is that 'repl is a Python string, and in a python string the backslash normally has a special meaning. For instance '\t'
represents the tab character. But in our 'repl' string, we want to turn off this special meaning.
For instance r'\t'
would represent a 2-character string.
This r'X' syntax is referred to as a "raw string" in Python.
Try to apply this information to our situation. I think the solution should be ALMOST right.
Try to get pattern and repl to change
'blah {%heavy {#guru#} teacher%}'
'blah {%heavy {#guru#} teacher%}'
Dear Jim, Please, check this solution:
import re
old = 'blah {%heavy {#guru#} teacher%}'
x = re.findall(r'{%[^%]+{#.+#}', old)
if (x):
old1 = old
pattern = '({#.+#})'
repl = r'%} \1 {%'
new = re.sub(pattern,repl,old1)
print(new) # blah {%heavy %} {#guru#} {% teacher%}
In general it works. BUT I need your help with our python file. Could you explain where I am wrong:
def make_changes(entries):
n = 0
for entry in entries:
changes = []
for iline,line in enumerate(entry.datalines):
x = re.findall(r'{%[^%]+{#.+#}', line)
if (x):
line1 = line
pattern = '({#.+#})'
repl = r'%} \1 {%'
newline = re.sub(pattern,repl,line1)
if newline == line:
continue
change = Change(iline,newline)
entry.changes = changes
$ python change_1.py temp_pw_0.txt change_1.txt
File "change_1.py", line 28
newline = re.sub(pattern,repl,line1)
^
IndentationError: unindent does not match any outer indentation level
Please add/push -- I'll need to see the file.
I should have said 'add/commit/push' . You might also do a 'git status' after 'add', to check that you are committing only what you intend to commit.
I'll need to see the file
Dear Jim, Now you can see the file change1.py
@AnnaRybakovaT You have tab characters in lines 25,26,27. Maybe your text editor did this, maybe you did it on purpose. Either way, bad idea. That is what caused the IndentationError message which means things did not 'line up' properly.
Recommend you follow the 'one-space' indentation method that I use in this file and other code files. Go ahead and make this change, so at least your change_1.py program runs.
When you get past the indentation problem, there will likely still be some problems with your code, although I'm not sure how the problems will manifest themselves. Struggle a bit with a solution. Do a few exercises on re.findall, re.sub, re.search such as at that w3schools link, and perhaps some other online resource that you find.
When you're more comfortable with what these re functions do, come back to our problem.
You might try to learn how to do small code tests, where the problem area is isolated from its current rather complex environment within change_1.py.
When you're ready, upload further test code and/or new change_1.py and describe where stuck in a comment here.
In your test programs, use a function 'change_line(old)' to do the work. It will return 'new'. Both old and new are strings. If the line does not have the pattern, the returned 'new' will be the same as 'old'.
You can try out different inputs for 'old' in your test program, and the 'main' part of the test program will print out old and new.
When the change_line function works in your tests, then you can copy it into change_1.py, and then
make_changes function will say newline = change_line(line); if newline=line:continue, else .....
at least your change_1.py program runs
Dear Jim, I have corrected lines 25-27, but still my rogram doesn't run:
$ python change_1.py temp_pw_0.txt change_1.txt
682619 lines read from temp_pw_0.txt
135788 entries found
Traceback (most recent call last):
File "change_1.py", line 109, in <module>
write_changes(fileout,entries)
File "change_1.py", line 45, in write_changes
nchange = nchange + len(entry.changes)
AttributeError: 'Entry' object has no attribute 'changes'
You might try to learn how to do small code tests,
I will try to do it on Monday. Just now I would like to congratulate you with New year! My the best wishes!!!!!!!!
The error is due to an indentation problem at line 38 - that is where 'entry.changes' is set. There are also a couple of other indentation problems.
I have corrected these in file change_1_ejf_01.py.
Happy New Year to you as well! -- Maybe the new year has already started in your time zone?
Maybe the new year has already started in your time zone?
Dear Jim, Thanks for your congratulations! In Greece the new year started in some hours after I had sent this message. I had enough time to make dinner since we celebrated at home.
You might try to learn how to do small code tests
To check if my regEx functions work, I have written the python program test1.py (by updating our the first python program readwrite.py). test1.py reads the lines of test1.txt and write all lines (including updated lines) to the file results1.txt
When the change_line function works in your tests, then you can copy it into change_1.py, and then make_changes function will say
newline = change_line(line); if newline=line:continue, else .....
But I have problems on this step. After a hour of unsuccessful tests I need your help. Please check the file change_test.py
problem at line 38
Thanks! I tried to be attentive, but I missed this error.
Really nice comments in test1.py.
They are a big help in knowing your thoughts.
Also, the organization of 'main' is excellent.
I have made a new version, which simplifies 'change_lines' so all the work of changing one line is done in a separate change_one_line function
This is a refactoring of test1.py.
Reason: To make it easier to debug how we handle each line of input.
Note the comments added to readme.txt.
in change_one_line, temporarily print out 'x', which is returned by re.findall. Is x ever None? Is this what you expected? Is 'x' ever used otherwise? Is re.findall needed at all?
Ponder the logic of change_one_line. Make a new version, test3.py. And experiment some with change_one_line. Continue the good documentation comments in revised change_one_line of test3.py, and be sure to add usage comments to readme.txt. When you're done (or stuck) add/commit/push. NOTE: In your commit message, add '#91' -- Do you see what this does in the comments of this issue?
In this case, we still have problems to solve
Note: That temporarily print out 'x'
comment should be done in test3.py. Leave test1 and test2 as they are.
Comment on change_test.py.
Your 'change_line' function in change_test is 'grammatically correct' (properly indented, etc.) .
There are several problems in make_changes function. The program fails with error:
File "C:\xampp\htdocs\sanskrit-lexicon\PWK\pwkissues\issue91\change_test.py",
line 31, in make_changes
newline = change_line(line)
NameError: name 'line' is not defined
This is because change_test is missing the loop over entry.datalines:
for iline,line in enumerate...
of change_1.py
When you get change_test to work, Let's continue work with test3, etc.
When test99.py 😄 is working properly, then we will be ready to go back and replace 'change_line' with perfected 'change_one_line' function.
DEBUGGING gets complicated, and somewhat hard to discuss. Persevere!
Note: That
temporarily print out 'x'
comment should be done in test3.py. Leave test1 and test2 as they are.
Dear Jim, please, see the file test3.py As you mentioned before re.findall was not necessary function. To be honest from the beginning I wanted to use a pattern from two groups but since I am not so confident in syntaxis, probably I did some mistackes in some symbols. Since I couldn't find this solution I started to find other options, where re.findall and re.search were on of them.
This is because change_test is missing the loop over entry.datalines:
for iline,line in enumerate...
of change_1.pyWhen you get change_test to work,
Thank you, now I can run the program but still without results - this program can't make changes. I did some tests unfortunately the same result.
Do you see what this does in the comments of this issue?
Yes, I see))
Do you see what this does in the comments of this issue?
Yes, I see))
AnnaRybakovaT added a commit that referenced this issue Jan 3, 2023 @AnnaRybakovaT https://github.com/sanskrit-lexicon/PWK/issues/91 - test3.py Batch 4
Really nice comments in test1.py.
Of cource all comments are nice since they belong to you (I have just used almost all your comments from the 1st python program). Sorry, I do not want to upset you, but my skills are not so high yet.
Note the comments added to readme.txt.
Dear Jim, I wanted to compare result1.txt and result2.txt by the 'diff' utility (how it mentioned in the readme.txt). There are NO differences between and nothing was printed to terminal. As well I did test for two more files which sure have differences between but NOTHING was printed to terminal. I just wanna to know what I do wrong:
Rybakova@ST-Rybakova MINGW64 ~/Documents/sanskrit-lexicon/PWK/pwkissues/issue91 (master)
$ git diff result2.txt readme.txt
Of cource all comments are nice since they belong to you
This made me laugh., as I didn't recognize these comments as coming from me. Usually my comments are not so good.
git diff result2.txt readme.txt
not git diff file1 file2
use diff file1 file2
diff result2.txt readme.txt
will give lots of output, since these two files are completely different.
diff result1.txt result2.txt
will give no output, since these two files are identical.
The 'git' program DOES have a 'diff' option (sort of like 'git add' runs the 'add' option of git). As with all git option the diff option of git has many variations (which can be seen by google search on 'git diff'). The only variation I understand and use (but only occasionally) is
git diff FILENAME
(one argument FILENAME).
Here is a description of when 'git diff FILENAME' might be useful.
Suppose you make a small change to change_1.py (or result1.txt, any file tracked by git).
When you do a 'git status', this changed file will be listed as modified.
But you forgot what change you made to the file. So the file is (a) modified but (b) not yet commited. How can you find out exactly what was changed in that file?
Answer : git diff FILENAME
.
Next comment will walk through an example of git diff FILENAME
git status
I really didn't want to make those changes to result1.txt. 'git restore' to the rescue to undo the changes. Notice that in the last 'git status', there was this helpful comment
(use "git restore <file>..." to discard changes in working directory)
change_one_line is close to the desired.
I found it awkward to compare individual lines of test1.txt with those of result3.txt. test4.py revises write_lines, so as to make the comparison easier. @AnnaRybakovaT Study write_lines in test4.py (and the minor change to 'main'). I tried to do 'good commenting'. Also wrote usage note in readme.
Next steps for you, @AnnaRybakovaT .
newstring = re.sub(pattern, FUNCTION, string)
This is the form.
This technique is considered advanced, and I haven't found a good thorough explanation. Here are some preliminary web resources to get you started.
Using a string with backreferences (e.g. '\1' for repl) is not quite flexible enough in our
situation. Using a function for repl provides enormous flexibility.
In our case, we might take pattern as all italic texts: {%.*?%}
[Do you understand the '?'].
Then `re.sub(pattern, replfunc,line) would start with each italic substring in line, do some changes to that italic substring as required, return the replacement for the italic substring,
and, and then re.sub would replace the italic substring with that replacement.
So, anyway, get started with this using the examples sources given above.
When you're ready, I'll contribute further suggestions.
Bonus 'git restore'
Thanks! It works))))!
not
git diff file1 file2
usediff file1 file2
diff result2.txt readme.txt
will give lots of output, since these two files are completely different.
Thank you very much!
Study write_lines in test4.py (and the minor change to 'main'). I tried to do 'good commenting'. Also wrote usage note in readme.
Dear Jim, Everything is more that clear (thanks to your nice comments)! Tomorrow I will focus on replacement function.
{%.*?%}
Dear Jim, The given above sources are really useful and informative. Many thanks!
Unfortunately I still can't find solution for current task. There are just some ideas:
line = '1) {%erfreuen. {#anumodita#} erfreut.%} ddddd'
pattern = r'{%.*?%}' # we use ? for non-greedy quantifier
def FUNCTION_new_pattern (pattern):
# I am not sure about re.findall, but the idea to chech if a pattern includes {#.+#}
x = re.findall(r'{#.+#}', pattern)
if (x):
new pattern = r'({%.+)%}({#.+#}){%(.+%})' # I don't know if it will work (my idea - to show 3 groups and new symbols %} {%
else:
new pattern = pattern
newstring = re.sub(pattern, FUNCTION_new_pattern, line)
@AnnaRybakovaT Try out the test5.py program.
Readme provides basic usage.
Exercise 1: Try to match patterns beginning with {%
and ending with %}
.
What pattern to use so that there is one match group containing the text between {%
and %}
.
Exercise 2: find PATTERN that matches 3 parts of LINE blah {%XX {#YY#} ZZ%} more blah
and similar lines.
has as its 3 match groups XX
and YY
and ZZ
(note the space ending the X group, and the space beginning the Z group.
Exercise 3: With the PATTERN of exercise 2, what do you get if LINE is blah {%XX {#YY#}, ZZ%} more blah
?
Note the comma after {#YY#}
? Try different things for XX, etc.
Play around with many PATTERN LINE combos to help you get comfortable with regex pattern matching and match groups.
@AnnaRybakovaT
If you want to fiddle with regular expressions to understand how they capture the strings, you can try https://regexr.com/
Play around with many PATTERN LINE combos to help you get comfortable with regex pattern matching and match groups.
Dear Jim, Thanks for the above exercises and test5.py I have found a pattern which helped to construct the "different lines" by the most optimal way (please see files test6.py and result6.txt).
Try to match patterns beginning with
{%
and ending with%}
. What pattern to use so that there is one match group containing the text between{%
and%}
.
({%.?%}) -> m.group(1) |{%XX {#YY#} ZZ%}| {%(.?)%} -> m.group(1) |XX {#YY#} ZZ| ({%.+)({#.+#})(.+%}) -> m.group(1) |{%XX | m.group(2) |{#YY#}| m.group(3) | ZZ%}| {%(.+\s)({#.+#})(\s.+)%} -> m.group(1) |XX | m.group(2) |{#YY#}| m.group(3) | ZZ|
Exercise 3: With the PATTERN of exercise 2, what do you get if LINE is
blah {%XX {#YY#}, ZZ%} more blah
? Note the comma after{#YY#}
? Try different things for XX, etc.
{%(.+\s)({#.+#}.*)(\s.+)%} -> m.group(1) |XX | m.group(2) |{#YY#},| m.group(3) | ZZ|
newstring = re.sub(pattern, FUNCTION, string)
This is the form.
Dear Jim, I stuck with the file change_test.py I am thinking how to use this form and the pattern {%.*?%} Now I only put the function from test6.py. Could you give me some ideas regarding next steps, please.
If you want to fiddle with regular expressions to understand how they capture the strings, you can try https://regexr.com/
Dear Dhaval, Thank you very much!
@AnnaRybakovaT Based on your test6/result6 -- looks like you found a solution with re.sub.
I like the use of \s
and \S
-- I had not thought of that idea.
When I tried python change_test.py temp_pw_0.txt temp_change_test.yxy
, the program was
taking too long.
So I made change_test1.py , which
[[...]]
) so we can better see
what the replacement is doing.Look at change_test1.txt.
You'll see that the replacement is changing many that it should not.
e.g. {%XX%} {#YY#} {%ZZ%}
is being changed.
Tested this further with test5.py
python test5.py "{%(.+\S)(.*\s?{#.+#}.*\s)(.+)%}" " ... {%italic text%} {#sanskrit text#} blah {%more italic text%}"
LINE | ... {%italic text%} {#sanskrit text#} blah {%more italic text%}|
PATTERN |{%(.+\S)(.*\s?{#.+#}.*\s)(.+)%}|
m.group(0) |{%italic text%} {#sanskrit text#} blah {%more italic text%}|
PATTERN has 3 match groups
m.group(1) |italic text%}|
m.group(2) | {#sanskrit text#} blah {%more italic |
m.group(3) |text|
Suggestion: Maybe [^%]
can solve this problem ?
Experiment some (using test5) to revise your pattern.
Then, make a change_test2.py that uses your revised pattern and/or repl.
Does output look right now ? (We'll remove the [[...]]
later)
Suggestion: Maybe
[^%]
can solve this problem ?
Dear Jim, Please, help...
I would like to use in a pattern [^%]
- in group1 or after group1 (just make tests) because I don't wanna the part {%italic text%} matches group1 of this pattern. but it doesn't work:
$ python test5.py "{%(.+[^%][^}]\S)(.*\s?{#.+#}.*\s)(.+)%}" " ... {%italic text%} {#sanskrit text#} blah {%more italic text%}"
LINE | ... {%italic text%} {#sanskrit text#} blah {%more italic text%}|
PATTERN |{%(.+[^%][^}]\S)(.*\s?{#.+#}.*\s)(.+)%}|
m.group(0) |{%italic text%} {#sanskrit text#} blah {%more italic text%}|
PATTERN has 3 match groups
m.group(1) |italic text%}|
m.group(2) | {#sanskrit text#} blah {%more italic |
m.group(3) |text|
Maybe we can avoid this problem by other way - if we exclude from the beginning all italic cases without Sanskrit text {%italic text%} and we will construct one pattern only for strings {%italic text {#sanskrit text#} more italic text%}. Probably for this reason you suggested above this method: newstring = re.sub(pattern, FUNCTION, line)
I don't know - if it has sence (since a program test7.py with those functions doesn't run):
def FUNCTION_new_pattern(line):
pattern = r'{%(.+\S)(.*\s?{#.+#}.*\s)(.+)%}'
repl = r'{%\1%}\2{%\3%}'
newline = re.sub(pattern,repl,line)
def change_one_line(line):
pattern = r'{%.*?%}'
newline = re.sub(pattern, FUNCTION_new_pattern, line)
return newline
Solve the problem mentioned in https://github.com/sanskrit-lexicon/csl-orig/issues/1033.