sanskrit-lexicon / PWK

Sanskrit-Wörterbuch in kürzerer Fassung, 7 Bände Petersburg 1879-1889
3 stars 1 forks source link

remove sanskrit italics #91

Closed funderburkjim closed 1 year ago

funderburkjim commented 1 year ago

Solve the problem mentioned in https://github.com/sanskrit-lexicon/csl-orig/issues/1033.

funderburkjim commented 1 year ago

Get latest version of this pwk repository from github:

funderburkjim commented 1 year ago

Similarly, get up-to-date local installation of sanskrit-lexicon/csl-orig repository.

funderburkjim commented 1 year ago

@AnnaRybakovaT I have prepared a 'sample' for the change. It just changes 'Pron.' to 'Pronoun' Files are in the pwk/pwsissues/issue91 folder. Once you are set up as above, try

python change_sample.py temp_pw_0.txt tempanna_change_sample.txt
# then
diff change_sample.txt tempanna_change_sample.txt
# there should be no output  (meaning the files are the same)

At this point, make a copy:

cp change_sample.py change_1.py

Then modify the change_1.py to do what we really want to do here.

python change_1.py temp_pw_0.txt change_1.txt

When done, modify readme.txt to say what was done. Then push the revised pwk repository.

Good luck!

funderburkjim commented 1 year ago

@AnnaRybakovaT I got a notification of an email from you from this issue, part of which said that that you got an error when trying python change_sample.py temp_pw_0.txt tempanna_change_sample.txt because the file temp_pw_01.txt was not found.

Looks like I forgot to say something.

Here's what to do:

funderburkjim commented 1 year ago

Oops. I goofed again. Abvoe, I should have said copy that file to your copy of pwk/pwkissues/issue91/temp_pw_0.txt

I think my comments in readme.txt of issue91 directory are correct regarding temp_pw_0.txt.

I did make one change in readme.txt in pwk/pwkissues/issue91, So please pull pwk again first.

Jim

AnnaRybakovaT commented 1 year ago

I got a notification of an email from you from this issue, part of which said that that you got an error when trying

Dear Jim, Yes, in the beginning I had this problem but after reading the readme.txt (better to do it always first of all) it was solved.

AnnaRybakovaT commented 1 year ago

So, I need help. After long pause I was trying to refresh my knowledge but it was not enough. There is my logical chain: 1) there is a correction which has to be done

old: <div n="1">— 1) {%erfreuen. {#anumodita#} erfreut.%}
new: <div n="1">— 1) {%erfreuen.%} {#anumodita#} {%erfreut.%}

2) for identification of lines with italic Devanagari we can use regular expressions and metacharacters. For searching you have used this expression: "{%[^%]+{#" Could you explain me each symbol, please? As I see there are two conditions {%[^% and {#. I can't understand this symbol at all "[". As I know symbol "^" = starts with

3) Is it possible to generate changes just by line.replace or re.sub (if by metacharacters we will explain where and how exactly we wanna to make changes)?

funderburkjim commented 1 year ago

OK. This is a question of how to do string replacements with the 're' (regular expression) module of Python.

Regular expressions can be quite complicated, and our current task is moderately advanced. Lets use the current task as motivation for your becoming self-sufficiant with regular expressions in general, and more specifically in Python.

I assume you know the basics of regular expressions and how to use re.search, re.sub, re.findall in Python?
Many on-line resources for beginning tutorials on Python regular expressions. For example, https://www.w3schools.com/python/python_regex.asp .

In a regular expression pattern, usually [X] matches a set of characters For example, [dh] is a set of two specifiic characters.

X can also be written to match a range of characters: [b-f] matches 5 characters. And there are a few other details regarding how X is interpreted.

The ^ character does usually mean 'starts with' in a regular expression pattern. However, ^ has a different meaning when it is the first character in a pattern representing a set of characters. i.e., [^X] matches any set of characters except those in X. So [^dh] matches any characters except 'd' and 'h'. and [^%] matches any character except '%'.

Symbolically, we can represent our task approximately as {%X{#Y#}Z%} -> {%X%}{#Y#}{%Z%}

When you're ready, we'll think together further about our task. Questions always welcome at any point.

AnnaRybakovaT commented 1 year ago

{%X{#Y#}Z%} -> {%X%}{#Y#}{%Z%}

Dear Jim, Thank you very much for explanations. I realised my missunderstanding redarding set of characters [ ].

If I wanna to use re.search, I know what exactly we would like to replace: {%X{#Y#}Z%} -> {%[^%]+{#.+#}

BUT I have problems with the second component {%X%}{#Y#}{%Z%} -> ?

newline = re.sub(r'{%[^%]+{#.+#}', '?', line)

In the beggining I wanted to use one more regular expression (something like this): ? -> {%.%}+{#.+#}{% but it doesn't work

Could you kindly give me some advices?

funderburkjim commented 1 year ago

A name for the 2nd argument of re.sub (the one you call '?') seems to be the repl argument. A name for the 1st argument might be the 'match pattern' or just the pattern argument. A near solution to our problem can be found by using two features:

Suppose we wanted to change 'pXt' to 'bXt' in any string, where X is a vowel. So pit -> bit, pat -> bat, etc.

import re
pattern = 'p([aeiou])t'
repl = r'b\1t'   # r means 'raw' string  See note below
old = 'Please pet the dog'
new = re.sub(pattern,repl,old)
print(new)  # Please bet the dog
old1 = 'Where is the pit?'
new1 = re.sub(pattern,repl,old1)
print(new1) # Where is the bit?

The '\1' in repl refers to the first matching group in pattern. In our example there is only one matching group. So what is the reason for the r in repl? Try the example without the 'r'. You will see that the vowel is not properly represented in 'new'. The reason is that 'repl is a Python string, and in a python string the backslash normally has a special meaning. For instance '\t' represents the tab character. But in our 'repl' string, we want to turn off this special meaning. For instance r'\t' would represent a 2-character string. This r'X' syntax is referred to as a "raw string" in Python.

Try to apply this information to our situation. I think the solution should be ALMOST right.

funderburkjim commented 1 year ago

Try to get pattern and repl to change

'blah {%heavy {#guru#} teacher%}'
AnnaRybakovaT commented 1 year ago

'blah {%heavy {#guru#} teacher%}'

Dear Jim, Please, check this solution:

import re
old = 'blah {%heavy {#guru#} teacher%}'
x = re.findall(r'{%[^%]+{#.+#}', old)
if (x):
  old1 = old
  pattern = '({#.+#})'
  repl = r'%} \1 {%' 
  new = re.sub(pattern,repl,old1)
  print(new) #  blah {%heavy %} {#guru#} {% teacher%}

In general it works. BUT I need your help with our python file. Could you explain where I am wrong:

def make_changes(entries):
 n = 0
 for entry in entries:
  changes = []
  for iline,line in enumerate(entry.datalines):
   x = re.findall(r'{%[^%]+{#.+#}', line)
   if (x):
    line1 = line
    pattern = '({#.+#})'
    repl = r'%} \1 {%'
    newline = re.sub(pattern,repl,line1)
    if newline == line:
     continue
     change = Change(iline,newline)
     entry.changes = changes

$ python change_1.py temp_pw_0.txt change_1.txt
  File "change_1.py", line 28
    newline = re.sub(pattern,repl,line1)
                                      ^
IndentationError: unindent does not match any outer indentation level
funderburkjim commented 1 year ago

Please add/push -- I'll need to see the file.

funderburkjim commented 1 year ago

I should have said 'add/commit/push' . You might also do a 'git status' after 'add', to check that you are committing only what you intend to commit.

AnnaRybakovaT commented 1 year ago

I'll need to see the file

Dear Jim, Now you can see the file change1.py

funderburkjim commented 1 year ago

@AnnaRybakovaT You have tab characters in lines 25,26,27. Maybe your text editor did this, maybe you did it on purpose. Either way, bad idea. That is what caused the IndentationError message which means things did not 'line up' properly.

Recommend you follow the 'one-space' indentation method that I use in this file and other code files. Go ahead and make this change, so at least your change_1.py program runs.

When you get past the indentation problem, there will likely still be some problems with your code, although I'm not sure how the problems will manifest themselves. Struggle a bit with a solution. Do a few exercises on re.findall, re.sub, re.search such as at that w3schools link, and perhaps some other online resource that you find.

When you're more comfortable with what these re functions do, come back to our problem.

You might try to learn how to do small code tests, where the problem area is isolated from its current rather complex environment within change_1.py.

When you're ready, upload further test code and/or new change_1.py and describe where stuck in a comment here.

funderburkjim commented 1 year ago

In your test programs, use a function 'change_line(old)' to do the work. It will return 'new'. Both old and new are strings. If the line does not have the pattern, the returned 'new' will be the same as 'old'.

You can try out different inputs for 'old' in your test program, and the 'main' part of the test program will print out old and new.

When the change_line function works in your tests, then you can copy it into change_1.py, and then make_changes function will say newline = change_line(line); if newline=line:continue, else .....

AnnaRybakovaT commented 1 year ago

at least your change_1.py program runs

Dear Jim, I have corrected lines 25-27, but still my rogram doesn't run:

$ python change_1.py temp_pw_0.txt change_1.txt
682619 lines read from temp_pw_0.txt
135788 entries found
Traceback (most recent call last):
  File "change_1.py", line 109, in <module>
    write_changes(fileout,entries)
  File "change_1.py", line 45, in write_changes
    nchange = nchange + len(entry.changes)
AttributeError: 'Entry' object has no attribute 'changes'
AnnaRybakovaT commented 1 year ago

You might try to learn how to do small code tests,

I will try to do it on Monday. Just now I would like to congratulate you with New year! My the best wishes!!!!!!!!

funderburkjim commented 1 year ago

The error is due to an indentation problem at line 38 - that is where 'entry.changes' is set. There are also a couple of other indentation problems.

I have corrected these in file change_1_ejf_01.py.

  1. modify your change_1.py similarly
    • then change_1.py should run.
  2. modify readme.txt to show how you run change_1.py

Happy New Year to you as well! -- Maybe the new year has already started in your time zone?

AnnaRybakovaT commented 1 year ago

Maybe the new year has already started in your time zone?

Dear Jim, Thanks for your congratulations! In Greece the new year started in some hours after I had sent this message. I had enough time to make dinner since we celebrated at home.

AnnaRybakovaT commented 1 year ago

You might try to learn how to do small code tests

To check if my regEx functions work, I have written the python program test1.py (by updating our the first python program readwrite.py). test1.py reads the lines of test1.txt and write all lines (including updated lines) to the file results1.txt

AnnaRybakovaT commented 1 year ago

When the change_line function works in your tests, then you can copy it into change_1.py, and then make_changes function will say newline = change_line(line); if newline=line:continue, else .....

But I have problems on this step. After a hour of unsuccessful tests I need your help. Please check the file change_test.py

AnnaRybakovaT commented 1 year ago

problem at line 38

Thanks! I tried to be attentive, but I missed this error.

funderburkjim commented 1 year ago

love the comments

Really nice comments in test1.py.
They are a big help in knowing your thoughts. Also, the organization of 'main' is excellent.

I have made a new version, which simplifies 'change_lines' so all the work of changing one line is done in a separate change_one_line function

test2.py

This is a refactoring of test1.py.
Reason: To make it easier to debug how we handle each line of input. Note the comments added to readme.txt.

Not sure about findall

in change_one_line, temporarily print out 'x', which is returned by re.findall. Is x ever None? Is this what you expected? Is 'x' ever used otherwise? Is re.findall needed at all?

test3.py

Ponder the logic of change_one_line. Make a new version, test3.py. And experiment some with change_one_line. Continue the good documentation comments in revised change_one_line of test3.py, and be sure to add usage comments to readme.txt. When you're done (or stuck) add/commit/push. NOTE: In your commit message, add '#91' -- Do you see what this does in the comments of this issue?

In this case, we still have problems to solve

funderburkjim commented 1 year ago

Note: That temporarily print out 'x' comment should be done in test3.py. Leave test1 and test2 as they are.

funderburkjim commented 1 year ago

Comment on change_test.py.

Your 'change_line' function in change_test is 'grammatically correct' (properly indented, etc.) .

There are several problems in make_changes function. The program fails with error:

  File "C:\xampp\htdocs\sanskrit-lexicon\PWK\pwkissues\issue91\change_test.py",
line 31, in make_changes
    newline = change_line(line)
NameError: name 'line' is not defined

This is because change_test is missing the loop over entry.datalines: for iline,line in enumerate... of change_1.py

When you get change_test to work, Let's continue work with test3, etc.

When test99.py 😄 is working properly, then we will be ready to go back and replace 'change_line' with perfected 'change_one_line' function.

DEBUGGING gets complicated, and somewhat hard to discuss. Persevere!

AnnaRybakovaT commented 1 year ago

Note: That temporarily print out 'x' comment should be done in test3.py. Leave test1 and test2 as they are.

Dear Jim, please, see the file test3.py As you mentioned before re.findall was not necessary function. To be honest from the beginning I wanted to use a pattern from two groups but since I am not so confident in syntaxis, probably I did some mistackes in some symbols. Since I couldn't find this solution I started to find other options, where re.findall and re.search were on of them.

AnnaRybakovaT commented 1 year ago

This is because change_test is missing the loop over entry.datalines: for iline,line in enumerate... of change_1.py

When you get change_test to work,

Thank you, now I can run the program but still without results - this program can't make changes. I did some tests unfortunately the same result.

AnnaRybakovaT commented 1 year ago

Do you see what this does in the comments of this issue?

Yes, I see))

AnnaRybakovaT commented 1 year ago

Do you see what this does in the comments of this issue?

Yes, I see))

AnnaRybakovaT added a commit that referenced this issue Jan 3, 2023 @AnnaRybakovaT https://github.com/sanskrit-lexicon/PWK/issues/91 - test3.py Batch 4

AnnaRybakovaT commented 1 year ago

Really nice comments in test1.py.

Of cource all comments are nice since they belong to you (I have just used almost all your comments from the 1st python program). Sorry, I do not want to upset you, but my skills are not so high yet.

AnnaRybakovaT commented 1 year ago

Note the comments added to readme.txt.

Dear Jim, I wanted to compare result1.txt and result2.txt by the 'diff' utility (how it mentioned in the readme.txt). There are NO differences between and nothing was printed to terminal. As well I did test for two more files which sure have differences between but NOTHING was printed to terminal. I just wanna to know what I do wrong:

Rybakova@ST-Rybakova MINGW64 ~/Documents/sanskrit-lexicon/PWK/pwkissues/issue91 (master)
$ git diff result2.txt readme.txt
funderburkjim commented 1 year ago

Of cource all comments are nice since they belong to you

This made me laugh., as I didn't recognize these comments as coming from me. Usually my comments are not so good.

funderburkjim commented 1 year ago

git diff result2.txt readme.txt

not git diff file1 file2 use diff file1 file2

diff result2.txt readme.txt will give lots of output, since these two files are completely different.

diff result1.txt result2.txt will give no output, since these two files are identical.

what about git diff?

The 'git' program DOES have a 'diff' option (sort of like 'git add' runs the 'add' option of git). As with all git option the diff option of git has many variations (which can be seen by google search on 'git diff'). The only variation I understand and use (but only occasionally) is git diff FILENAME (one argument FILENAME).

Here is a description of when 'git diff FILENAME' might be useful. Suppose you make a small change to change_1.py (or result1.txt, any file tracked by git). When you do a 'git status', this changed file will be listed as modified. But you forgot what change you made to the file. So the file is (a) modified but (b) not yet commited. How can you find out exactly what was changed in that file? Answer : git diff FILENAME.

Next comment will walk through an example of git diff FILENAME

funderburkjim commented 1 year ago

make a change to result1.txt and see how this is noticed by git status

image

I forgot what was changed. Use 'git diff` to find out

image

Bonus 'git restore'

I really didn't want to make those changes to result1.txt. 'git restore' to the rescue to undo the changes. Notice that in the last 'git status', there was this helpful comment (use "git restore <file>..." to discard changes in working directory)

image

funderburkjim commented 1 year ago

test3

change_one_line is close to the desired.

test4

I found it awkward to compare individual lines of test1.txt with those of result3.txt. test4.py revises write_lines, so as to make the comparison easier. @AnnaRybakovaT Study write_lines in test4.py (and the minor change to 'main'). I tried to do 'good commenting'. Also wrote usage note in readme.

result4

Next steps for you, @AnnaRybakovaT .

learning how to use a function for 'repl'

newstring = re.sub(pattern, FUNCTION, string) This is the form.

This technique is considered advanced, and I haven't found a good thorough explanation. Here are some preliminary web resources to get you started.

Using a string with backreferences (e.g. '\1' for repl) is not quite flexible enough in our situation. Using a function for repl provides enormous flexibility. In our case, we might take pattern as all italic texts: {%.*?%} [Do you understand the '?']. Then `re.sub(pattern, replfunc,line) would start with each italic substring in line, do some changes to that italic substring as required, return the replacement for the italic substring, and, and then re.sub would replace the italic substring with that replacement.

So, anyway, get started with this using the examples sources given above.
When you're ready, I'll contribute further suggestions.

AnnaRybakovaT commented 1 year ago

Bonus 'git restore'

Thanks! It works))))!

AnnaRybakovaT commented 1 year ago

not git diff file1 file2 use diff file1 file2

diff result2.txt readme.txt will give lots of output, since these two files are completely different.

Thank you very much!

AnnaRybakovaT commented 1 year ago

Study write_lines in test4.py (and the minor change to 'main'). I tried to do 'good commenting'. Also wrote usage note in readme.

Dear Jim, Everything is more that clear (thanks to your nice comments)! Tomorrow I will focus on replacement function.

AnnaRybakovaT commented 1 year ago

{%.*?%}

Dear Jim, The given above sources are really useful and informative. Many thanks!

Unfortunately I still can't find solution for current task. There are just some ideas:

line = '1) {%erfreuen. {#anumodita#} erfreut.%} ddddd'
pattern = r'{%.*?%}'  # we use ? for non-greedy quantifier

def FUNCTION_new_pattern (pattern):
# I am not sure about re.findall, but the idea to chech if a pattern includes {#.+#}
 x = re.findall(r'{#.+#}', pattern)
 if (x):
  new pattern = r'({%.+)%}({#.+#}){%(.+%})' # I don't know if it will work (my idea - to show 3 groups and new symbols %} {% 
 else:
  new pattern = pattern

newstring =  re.sub(pattern, FUNCTION_new_pattern, line)
funderburkjim commented 1 year ago

@AnnaRybakovaT Try out the test5.py program.

Readme provides basic usage. Exercise 1: Try to match patterns beginning with {% and ending with %}.
What pattern to use so that there is one match group containing the text between {% and %}.

Exercise 2: find PATTERN that matches 3 parts of LINE blah {%XX {#YY#} ZZ%} more blah and similar lines. has as its 3 match groups XX and YY and ZZ (note the space ending the X group, and the space beginning the Z group.

Exercise 3: With the PATTERN of exercise 2, what do you get if LINE is blah {%XX {#YY#}, ZZ%} more blah ? Note the comma after {#YY#} ? Try different things for XX, etc.

Play around with many PATTERN LINE combos to help you get comfortable with regex pattern matching and match groups.

drdhaval2785 commented 1 year ago

@AnnaRybakovaT

If you want to fiddle with regular expressions to understand how they capture the strings, you can try https://regexr.com/

AnnaRybakovaT commented 1 year ago

Play around with many PATTERN LINE combos to help you get comfortable with regex pattern matching and match groups.

Dear Jim, Thanks for the above exercises and test5.py I have found a pattern which helped to construct the "different lines" by the most optimal way (please see files test6.py and result6.txt).

AnnaRybakovaT commented 1 year ago

Try to match patterns beginning with {% and ending with %}. What pattern to use so that there is one match group containing the text between {% and %}.

({%.?%}) -> m.group(1) |{%XX {#YY#} ZZ%}| {%(.?)%} -> m.group(1) |XX {#YY#} ZZ| ({%.+)({#.+#})(.+%}) -> m.group(1) |{%XX | m.group(2) |{#YY#}| m.group(3) | ZZ%}| {%(.+\s)({#.+#})(\s.+)%} -> m.group(1) |XX | m.group(2) |{#YY#}| m.group(3) | ZZ|

AnnaRybakovaT commented 1 year ago

Exercise 3: With the PATTERN of exercise 2, what do you get if LINE is blah {%XX {#YY#}, ZZ%} more blah ? Note the comma after {#YY#} ? Try different things for XX, etc.

{%(.+\s)({#.+#}.*)(\s.+)%} -> m.group(1) |XX | m.group(2) |{#YY#},| m.group(3) | ZZ|

AnnaRybakovaT commented 1 year ago

newstring = re.sub(pattern, FUNCTION, string) This is the form.

Dear Jim, I stuck with the file change_test.py I am thinking how to use this form and the pattern {%.*?%} Now I only put the function from test6.py. Could you give me some ideas regarding next steps, please.

AnnaRybakovaT commented 1 year ago

If you want to fiddle with regular expressions to understand how they capture the strings, you can try https://regexr.com/

Dear Dhaval, Thank you very much!

funderburkjim commented 1 year ago

@AnnaRybakovaT Based on your test6/result6 -- looks like you found a solution with re.sub. I like the use of \s and \S -- I had not thought of that idea.

When I tried python change_test.py temp_pw_0.txt temp_change_test.yxy, the program was taking too long.

So I made change_test1.py , which

Look at change_test1.txt. You'll see that the replacement is changing many that it should not. e.g. {%XX%} {#YY#} {%ZZ%} is being changed.

Tested this further with test5.py

 python test5.py "{%(.+\S)(.*\s?{#.+#}.*\s)(.+)%}" " ... {%italic text%} {#sanskrit text#} blah {%more italic text%}"
LINE | ... {%italic text%} {#sanskrit text#} blah {%more italic text%}|
PATTERN |{%(.+\S)(.*\s?{#.+#}.*\s)(.+)%}|
m.group(0) |{%italic text%} {#sanskrit text#} blah {%more italic text%}|
PATTERN has 3 match groups
m.group(1) |italic text%}|
m.group(2) | {#sanskrit text#} blah {%more italic |
m.group(3) |text|

Suggestion: Maybe [^%] can solve this problem ?

Experiment some (using test5) to revise your pattern. Then, make a change_test2.py that uses your revised pattern and/or repl. Does output look right now ? (We'll remove the [[...]] later)

AnnaRybakovaT commented 1 year ago

Suggestion: Maybe [^%] can solve this problem ?

Dear Jim, Please, help...

I would like to use in a pattern [^%] - in group1 or after group1 (just make tests) because I don't wanna the part {%italic text%} matches group1 of this pattern. but it doesn't work:

$ python test5.py "{%(.+[^%][^}]\S)(.*\s?{#.+#}.*\s)(.+)%}" " ... {%italic text%} {#sanskrit text#} blah {%more italic text%}"
LINE | ... {%italic text%} {#sanskrit text#} blah {%more italic text%}|
PATTERN |{%(.+[^%][^}]\S)(.*\s?{#.+#}.*\s)(.+)%}|
m.group(0) |{%italic text%} {#sanskrit text#} blah {%more italic text%}|
PATTERN has 3 match groups
m.group(1) |italic text%}|
m.group(2) | {#sanskrit text#} blah {%more italic |
m.group(3) |text|

Maybe we can avoid this problem by other way - if we exclude from the beginning all italic cases without Sanskrit text {%italic text%} and we will construct one pattern only for strings {%italic text {#sanskrit text#} more italic text%}. Probably for this reason you suggested above this method: newstring = re.sub(pattern, FUNCTION, line)

I don't know - if it has sence (since a program test7.py with those functions doesn't run):

def FUNCTION_new_pattern(line):
 pattern = r'{%(.+\S)(.*\s?{#.+#}.*\s)(.+)%}'
 repl = r'{%\1%}\2{%\3%}'
 newline = re.sub(pattern,repl,line)

def change_one_line(line):
 pattern = r'{%.*?%}'
 newline =  re.sub(pattern, FUNCTION_new_pattern, line)
 return newline