Closed funderburkjim closed 2 years ago
@AnnaRybakovaT We'll do the work in new step2a directory. Do a git pull to get the initial form of step2a. You'll be adding files to step2a directory as the task progresses.
Step2a use some of the programs of step2, along with make_change_pc.py program which will be a variation of the make_change_circumflex.py program of step2.
I initially made a copy of step2 directory with this variation of the 'cp' (copy) unix command:
cp -r step2 step2a
.
Then I removed unneeded files in step2a, and changed the name of the main program (using
unix 'mv' (move) command):
mv make_change_circumflex.py make_change_pc.py
You'll need a temporary copy of the latest 'md.txt'. One way to get this from csl-orig repository is with the 'curl' command (go ahead and do this). Remember we are in the step2a directory now.
curl https://raw.githubusercontent.com/sanskrit-lexicon/csl-orig/master/v02/md/md.txt -o temp_md_0.txt
@AnnaRybakovaT go ahead and do this.
We'll need a file with the lines of comment6 mentioned above.
@AnnaRybakovaT go ahead and create a new file step2a/pcerrors.txt, cutting and pasting from comment 6.
Be sure all the lines in the file have the same format.
When you've done that, push the md repository. While you're waiting for another suggested step from me, look at the make_change_pc.py program and get comfortable with the flow of activity. There will be several adjustments needed to adapt this to our current task.
look at the make_change_pc.py program and get comfortable with the flow of activity.
Dear Jim, Please see my ideas regarding ajustments of the phyton program. I am sorry if I do silly mistackes, just let me to try.
First of all I wanna to understand the expected result of our program. Regarding the current task it could be a text file with such data:
; -------------------------------------
; <L>8241<pc>104-a3<k1>WAtkAra<k2>WAtkAra :<pc>104-b2
old <L>8241<pc>104-a3<k1>WAtkAra<k2>WAtkAra
;
new <L>8241<pc>104-b2<k1>WAtkAra<k2>WAtkAra
; -------------------------------------
; <L>8242<pc>104-a3<k1>WAra<k2>WAra :<pc>104-b2
old <L>8242<pc>104-a3<k1>WAra<k2>WAra
;
new <L>8242<pc>104-b2<k1>WAra<k2>WAra
I suppose we can do it using as input only the file pcerrors.txt In this case:
Could you kindly explain me some terms of the program: iline cvrec outarr sirec wordcvs
@AnnaRybakovaT pcerrors.txt looks fine.
I will do some explaining tomorrow on how we can adapt make_change_pc.py to the current problem. Your description of the desired end result is on target.
One question - have you independently read up on Python classes (and objects) such as at https://www.w3schools.com/python/python_classes.asp ? The programs I write generally use only a very small part of the capabilities of classes; but I find the data encapsulation aspects of classes to be essential for coordinating data from various sources, as we are doing here.
Briefly, we will use 3 classes:
<L>...
to <LEND>
.The usage of the program will likely be:
python make_change_pc.py temp_md_0.txt pcerrors.txt change_1.txt
If you have not already done so, you can generate a copy of the latest md.txt in step2a directory; Follow the method used in step2. Call this temp_md_0.txt.
Look for more comments tomorrow.
@AnnaRybakovaT You can prepare make_change_pc.py as follows:
sirecs = ...
and sirecscv = ...
. entries = ...
, insert line exit(1)
. Now, the program should run (see usage above), but it won't do anything interesting yet.
One question - have you independently read up on Python classes (and objects)
Dear Jim, Yes, I have read about classes and objects but it was only theoretical knowledgement. I will refresh in mind this topic using the above link.
@AnnaRybakovaT Sorry I didn't get further discussion/instruction done today. Will aim for tomorrow.
@AnnaRybakovaT We will modify make_change_pc.py in steps.
modify the program as indicated above. Check that the classes and functions are:
class Change(object):
def generate_changes(entries,cvrecs):
def get_title(sirecs):
def write_changes(fileout,changes,title):
__main__
Add a line to initialize the Pcerror records to be computed from parsing pcerrors.txt And revise some of the comments.
if __name__=="__main__":
filein = sys.argv[1] # xxx.txt
filein1 = sys.argv[2] # pcerrors.txt
fileout = sys.argv[3] # changes
entries = digentry.init(filein)
pcrecs = init_pcrecs(filein1)
exit(1) # temporarily stop program here
changes = generate_changes(entries,sirecscv)
title = get_title(sirecs)
write_changes(fileout,changes,title)
You will revise this soon.
The property names (oldmetaline, etc.) are made up to be easy to remember in our
particular project.
class Pcerror(object):
def __init__(self,line):
"""
line example = '<L>8241<pc>104-a3<k1>WAtkAra<k2>WAtkAra :<pc>104-b2'
parse line into fields and set object properties
self.oldmetaline: '<L>8241<pc>104-a3<k1>WAtkAra<k2>WAtkAra'
self.newpc : '104-b2'
And then use re.sub to construct
self.newmetaline: '<L>8241<pc>104-b2<k1>WAtkAra<k2>WAtkAra'
Also, keep a copy of line in the object
"""
# dummy property values
self.line = line
self.oldmetaline = 'TO DO'
self.newpc= 'TO DO'
self.newmetaline = 'TO DO'
This simple program constructs a list (recs) of Pcerror objects, one object (record) for each line of pcerrors.txt and returns the list of records. As written, there is a debug section, which helps us decide whether the line is properly parsed.
def init_pcrecs(filein):
recs=[] # list of Pcerror objects, to be returned
dbg = True
with codecs.open(filein,encoding='utf-8',mode='r') as f:
for line in f:
line = line.rstrip('\r\n') # remove line-ending character(s)
rec = Pcerror(line) # parse line and get object
recs.append(rec) # add this record
print(len(recs),"records read from",filein)
if dbg: # print out first 3 records
for i in range(0,3):
rec = recs[i]
print('record',i+1) # why +1 ?
print(' oldmetaline = %s' % rec.oldmetaline)
print(' newpc = %s' % rec.newpc)
print(' newmetaline = %s' % rec.newmetaline)
return recs
Even though not finished, the program should work now
python make_change_pc.py temp_md_0.txt pcerrors.txt change_1.txt
So far, we have a dummy version of the Pcerror object initialization method (__init__
).
Your mission now is to make it work properly, so the object properties are constructed as
intended.
We might already have covered enough Python for you to do this -- I'm not sure. I'll leave it as a puzzle for you; Ask for help if you get stuck.
When you're ready, push this MD repo so I can see what you've done.
Dear Jim, 2 last days I was far away from my laptop. Tomorrow I will do this task according your instruction.
Dear Jim, I has followed all above steps and you can see my temporary results. Our modified program presents in the file test_make_changes_pc The output of program's running is one new directory pycache (to be honest I have no ideas what it is), probably I did mistake somewhere.
In step B: revise main I didn't put the line "pcrecs = init_pcrecs(filein1)" since I had the error note: NameError: name 'init_pcrecs' is not defined As well above you mentioned: "In main, delete line sirecs = ... and sirecscv = .... Just after entries = ..., insert line exit(1)"
self.line = line self.oldmetaline = 'TO DO' self.newpc= 'TO DO' self.newmetaline = 'TO DO'
Logically I have such solution, but I don't sure if it works in cases of Classes:
class Pcerror(object):
def __init__(self,line):
self.line = line
self.oldmetaline = line.split(" ", 1)
self.newpc = line.split("<pc>", 3)
self.newmetaline = re.sub(r':<pc>.*$', newpc, line)
@AnnaRybakovaT In your test program, you need to
There is also a problem with re.sub(r':<pc>.*$', self.newpc, line)
Actually, the problem is with 'self.newpc = line.split(...)' and the previous line also.
Remember that 'split' returns a LIST, not a string. But you want oldmetaline and newpc
to be strings. To help you debug, in init, put a print statement after each computed
variable: print('oldmetaline=',self.oldmetaline) ,etc.
Later, when you figure out what is going on, you can comment out or delete these print
statements.
Actually, the problem is with 'self.newpc = line.split(...)' and the previous line also.
Dear Jim, Thanks a lot! I had error messages about this but I couldn't guess what exactly the problem.
@AnnaRybakovaT be sure to put back into main:
pcrecs = init_pcrecs(filein1)
exit(1) # temporarily stop program here
So the init_pcrecs function will be run.
There is also a problem with
re.sub(r':<pc>.*$', self.newpc, line)
Dear Jim, Finally I found one solution, I don't know if it is optimal or not but it works))))
class Pcerror(object):
def __init__(self,line):
self.line = line
self.oldmetaline = re.sub(r"(:<pc>.*$)|("")", "", line)
self.newpc = re.sub(r"<L>.+:<pc>", "", line)
self.newmetaline = re.sub(r"<pc>.+<k1>", "<pc>" + self.newpc + "<k1>", self.oldmetaline)
As well I did all replacements and the program now works.
why +1 ?
From the beginning I couldn't guess an answer but now I understand what we wanted to do and why we use this syntax (to avoid numbers 0,1,2).
Looks like you're almost there!
I was worried about extra spaces in rec.oldmetaline and rec.newmetaline.
So I made small change in the dbg section of init_pcrecs to put double-quotes around %s
print(' oldmetaline = "%s"' % rec.oldmetaline)
print(' newpc = "%s"' % rec.newpc)
print(' newmetaline = "%s"' % rec.newmetaline)
Now the dbg output shows as
record 1
oldmetaline = "<L>8241<pc>104-a3<k1>WAtkAra<k2>WAtkAra "
newpc = "104-b2"
newmetaline = "<L>8241<pc>104-b2<k1>WAtkAra<k2>WAtkAra "
That extra space at the end needs to be removed. (Reason: it will get in the way when we correlate pcrec.oldmetaline with entry.metaline.
So revise Pcerror to remove that space.
Then, we'll be ready to work on 'generate_changes'
So revise Pcerror to remove that space.
Dear Jim, To be honest I expected not to see that space (for this reason I used: |("") and as I remember in my test file it worked (but maybe not...). In any case now it should be correct:
self.line = line
self.oldmetaline = re.sub(r"(:<pc>.*$)|(\s)", "", line)
self.newpc = re.sub(r"<L>.+:<pc>", "", line)
self.newmetaline = re.sub(r"<pc>.+<k1>", "<pc>" + self.newpc + "<k1>", self.oldmetaline)
Ready for next step -- generate_changes
First, modify main so generate_changes will be called, and then the program exits.
pcrecs = init_pcrecs(filein1)
changes = generate_changes(entries,pcrecs)
exit(1) # temporarily stop program here
Here is one way to write generate_changes for our present problem:
def generate_changes(entries,pcrecs):
changes = [] # computed by this function
for entry in entries:
pcrec = get_pcrec_for_entry(entry,pcrecs)
if pcrec != None:
# generate a change object
change = Change(entry,pcrec)
changes.append(change)
print(len(changes),'lines that may need changes')
return changes
In words: for each entry in our dictionary, use the get_pcrec_for_entry function to find the matching Pcerror record, if any. If there is no such record (i.e., pcrec is None), then no change record is made. If the entry matches one of the pcrecs , then we generate a change.
Note that there are two pieces of code not yet written: get_pcrec_for_entry function and a compatible Change constructor.
Here is a solution for the Change constructor, it is quite simple:
class Change(object):
def __init__(self,entry,pcrec):
self.entry = entry
self.pcrec = pcrec
Here is an incomplete solution for the get_pcrec_for_entry function:
def get_pcrec_for_entry(entry,pcrecs):
# find which pcrec matches entry, and return that pcrec.
# If no match is found, return None
return None
With this incomplete solution, the program runs, but is not useful. It reports
0 lines that may need changes
.
@AnnaRybakovaT Your task is to complete the get_pcrec_for_entry function.
When you've answered that, then devise a (simple) python procedure to do the matching. That will be your solution.
One partial check is to run the program (with the improved get_pcrec_for_entry) and see how many changes are reported (? lines that may need changes
.)
Of course, ask for hints as needed.
Dear Jim, Unfortunately I can't find solution without your hints. How long time I am trying to do it I have more and more doubt. In the beginning when I compared the attributes of the Entry and of the Pcerror classes - I thought we should use the record "rec.oldmetaline" or "self.oldmetaline". Probably this idea was wrong since I were receiving Error massages that such names are not defined. After I payed attention on "entry,pcrecs" (ie we have deal with one entry and many pcrecs). Now you can see my current suggestion, I know it dosn't work, but I hope you will help me to find right direction of solving this task.
@AnnaRybakovaT
Looks like you were close.
We need to compare these two for each pcrec in the list pcrecs. If the two match (are the same value) for some pcrec, we return that pcrec. If there is no match for any pcrec, we return None.
Here's a solution:
def get_pcrec_for_entry(entry,pcrecs):
# find which pcrec matches entry, and return that pcrec.
for pcrec in pcrecs:
# compare the metaline in entry to the metaline in pcrec
if pcrec.oldmetaline == entry.metaline:
return pcrec
# At this point, the for loop has ended, without returning a matching pcrec.
# the next return is at the same indentation as the 'for pcrec...' statement
return None
Here's a solution:
Dear Jim, Many thanks. Sure, I will had not found this solution without your help. I had used pcrec.oldmetaline as one atribut but I had no ideas about entry.metaline. The current result: 72 lines that may need changes
@AnnaRybakovaT Now we have the array 'changes' of Change objects, and are ready to adapt the code to write those changes to an output file.
Your comment above describes what the output should be.
Here are hints for you to get started.
title=get_title(sirecs)
def get_title(sirecs)
outarr.append('slp...
thru the line just before return outarr
. You can add some descriptive title lines if you want.iline = change.iline
through
outarr.append('%s new ...')
Then try running the program. You should get output, but it won't be quite what is needed.
At this point, your task will be to develop a replacement for the commented out lines of write_changes. You may need some hints here. Will await questions.
- comment out (with triple-quotes) the lines from
iline = change.iline
through
Dear Jim, Of course, I have some questions. First of all - as I see triple-quotes means like this - ''' ? I put it before the line "iline = change.iline" and after the line "outarr.append('%s new ..." but I couldn't run the program since I had this error message:
File "test_make_change_pc.py", line 87
outrecs.append(outarr)
^
IndentationError: unexpected indent
@AnnaRybakovaT The message means the indentation (number of spaces) is wrong somehow. Please push so I can examine the file to see exactly what is wrong.
to develop a replacement for the commented out lines of write_changes
Dear Jim, I have done some corrections using bellow ideas:
So - what we need: 1) a record from file pcerrors.txt I wanted to use such syntax but I receive error messages outarr.append('; ' %pcrec) or outarr.append('; ' + pcrec)
2) we need pcrec.oldmetaline 3) and pcrec.newmetaline Unfortunately also here my sintax is not correct.
Could you kindly help me.
@AnnaRybakovaT You are almost there!
outarr.append('; ' %pcrec)
Two problems,
%s
in the string '; ' resulting in outarr.append('; %s' %pcrec)
outarr.append('; %s' %metaline)
Problem with outarr.append('%s old %s' %oldmetaline)
Python interpreter would also complain
about this because there are two %s
formatting instructions in '%s old %s'
, but only one
thing to format oldmetaline
. In this case you need two things to format, and the syntax would be
outarr.append('%s old %s' %(SOMETHING,oldmetaline))
This is syntactically correct, but clearly SOMETHING is not right yet. What is that SOMETHING?
It is the line number in the input file for the digitization (e.g. temp_md_0.txt))which needs to be
changed. So let's call it 'lnum' instead of SOMETHING ('lnum' for 'line number').
So now we have outarr.append('%s old %s' %(lnum,oldmetaline))
But we have not yet established a value for lnum. Where do we get that line number for our
change record? Well, it sounds like an attribute of the 'entry' object associated with this change,
(Refer now to the Entry class within digentry.py).
The value of our lnum is in fact the value of the linenum1 attribute of our entry:
so insert the line lnum = X.Y
(You can find what X and Y are!).
You'll need a similar adjustment to outarr.append('%s new %s' %newmetaline)
.
With these few changes your program should now run to completion.
Please push so I can examine the file to see exactly what is wrong.
Please check the file test_make_change_pc_1.py
With these few changes your program should now run to completion.
Dear Jim, Now the program runs.
Many thanks for your above comments, I understood some my gaps. As well I have realized one important misunderstanding regarding the 1st line of our output file (change1.txt). Now it is a line (metaline) of entry file (temp_md_0.txt)
; <L>8241<pc>104-a3<k1>WAtkAra
but I thought it had to be a line from pcerrors.txt
<L>8241<pc>104-a3<k1>WAtkAra<k2>WAtkAra :<pc>104-b2
One more comment. Maybe it is not important but I would like to inform. When I started pushing the results I had this warning message:
$ git add .
warning: LF will be replaced by CRLF in deva_iast_comp/step2a/change_1.txt.
The file will have its original line endings in your working directory
Re 'LF will be replaced...'
We have avoided talking about how line breaks are represented in different operating systems.
You'll note that in the write_changes, we have f.write(out+'\n')
. The \n
is Python way of representing LF (linefeed, also called 'newline'); it is ascii code 10 (hex '0a'). newline is the default
Linux operating system way of designating line breaks,
If you create a new text file with your text editor on windows operating system, then the line breaks will normally be designated by 'CRLF' python '\r\n'
('\r'
represents 'carriage return', it is ascii code 13 (hex '0d')),
And MacOS uses just CR (I think).
Now in Git, lines are important -- When Git keeps track of changes to text files, it does it line by line. So Git has to make a choice of how to recognize line breaks. There is a git configuration parameter that is involved. Your git configuration parameter might differ from mine.
Our python program created change_1.txt to have 'LF' as the line break. And your git configuration is set up with CRLF as the line break, so in the repository (in the .git directory), the lines are saved with CRLF by git in .git directory. Since GIT changed 'LF' to 'CRLF', it made a WARNING note to you.
One final point -- In all the programs that we write, when we read the lines of a text file,
you will always see a statement that strips both CR and LF (e.g., line = line.rstrip('\r\n')
).
This protects the program from differences in the line-break characters in the file being read.
Final takeaway for you -- The warning is not something you need to worry about (at least not thus far).
I thought it had to be a line from pcerrors.txt
Note that the line in question starts with a semicolon -- thus it is a comment. But what is the significance of a line being a comment?
Well the main purpose in life of our change_1.txt file is to serve as an input to updateByLine.py.
Give it a try:
python updateByLine.py temp_md_0.txt change_1.txt temp_md_1.txt
This will
N old X
and N old Y
lines is a 'change transaction'After you run the program, open the 3 files in a text editor, and answer to yourself how temp_md_1.txt is indeed constructed from the two inputs.
@AnnaRybakovaT Your task in this project is now complete.
I will now install the changes to md into csl-orig. Here's how:
cp temp_md_1.txt /c/xampp/htdocs/cologne/csl-orig/v02/md/md.txt
cd /c/xampp/htdocs/cologne/csl-pywork/v02
sh generate_dict.sh md ../../md
sh xmlchk_xampp.sh md
cd /c/xampp/htdocs/cologne/csl-orig/v02
git pull # incase Dhaval or someone has modified csl-orig, bring mine up to date
git add md/md.txt
git commit -m "md: page column corrections. Ref: https://github.com/sanskrit-lexicon/MD/issues/7"
git push
The installation is complete. You see a reference above to csl-orig commit. You can see the changes there.
Also, If you look at a Cologne display for md, you can see the changes have been made, e.g. the first one
; -------------------------------------
; <L>8241<pc>104-a3<k1>WAtkAra
46346 old <L>8241<pc>104-a3<k1>WAtkAra<k2>WAtkAra
;
46346 new <L>8241<pc>104-b2<k1>WAtkAra<k2>WAtkAra
The display shows the new pc: '104-b2'
@AnnaRybakovaT I've noticed a small error in md that could be corrected in a way similar to the approach we've used in this 'pc error' project.
If you're ready for it, I'll write some instructions to get you started.
If you're ready for it, I'll write some instructions to get you started.
Dear Jim, First of all many thanks for your so detailed comments above. Step by step I learn more points, about which before I had no ideas (like comments wich marked by ";").
I am ready for the next task. Now difficult to make any plans but if everything is fine tourist season will start in Greece in April. In this case I will make pause and come back in November.
We aim to adapt programs from step2 to correct some 'pc' (page-column) errors in the metalines of entries in md.txt
The approach will be to use the data in this comment of #6 to generate a file of change transactions.
Some additional Python along the way.
Details to follow in further comments.