Closed funderburkjim closed 2 years ago
The framework used in the 'pcerror' corrections (#7) can be adapted to this page error situation.
@AnnaRybakovaT The first step for you is to create a new 'step2b' directory, similar to the step2a directory. Everything will be done in this step2b directory. And needed step2b files can be adapted from the corresponding step2a files.
First, modify readme.txt and in particular decide how a program to generate the changes should be invoked. In this case, we don't need an analog of the 'pcerrors.txt' file.
Then, get a fresh copy of the latest md digitization, as you did with step2a.
Then make your python step2b program, which will be an adaptation of the test_make_change_pc.py program you developed in step2a.
0 records written to ...
.When you've got it running, we'll develop the code needed to generate the changes for our current italicized page errors.
The instructions above are intentionally vague, because I want you to learn the important task of how to adapt a framework (like that of step2a) to a new situation.
But of course I'll provide more specific hints if you get stuck with this setup step.
to learn the important task of how to adapt a framework (like that of step2a) to a new situation.
I'm your fan.
But of course I'll provide more specific hints if you get stuck with this setup step.
Dear Jim, To be honest I spent some hourse trying to find solution, probably I got stuck on wrong way and just waisted time. So I really need your hints. There are my ideas.
First of all our file change2b.txt can have such structure: ; ------------------------------------- ; {%[Page210-2]%} 85319 old {%[Page210-2]%} ; 85319 new [Page210-2]
Since we don't need any extra data, I suppose our program to generate the changes can be invoked by: python make_change_pc2b.py temp_md_0.txt change_2b.txt
I have modified partly our program file (make_change_pc_2b.py). As I know in generate_changes function we should use RegEx. The easiest option is re.search (please see this test example): But when I run our program I have error message:
x = re.search(r"\{\%\[Page", entry)
File "C:\Users\Rybakova\AppData\Local\Programs\Python\Python38\lib\re.py", line 201, in search
return _compile(pattern, flags).search(string)
TypeError: expected string or bytes-like object
Maybe I do something absolutely wrong. Please help me to find a correct way.
@AnnaRybakovaT Good progress. I especially like the way you show a mock-up of what the output should look like.
Please do a pull, as I added a 'show_entry.py' program (see discussion below). Two minor quibbles:
def generate_changes(entries):
changes = [] # computed by this function
for entry in entries:
pass
print(len(changes),'lines that may need changes')
return changes
This will return an empty list, and the program will run, but the output will only contain the title.
You need to know how an 'entry' object corresponds to a set of lines in md.txt.
Here's an illustration that may help.
An entry object corresponds to the sequence of lines in md.txt starting with a metaline (line beginning with <L>
) and ending with the line beginning with <LEND>
.
entries[0] corresponds to the first entry, entries[1] to the second entry, etc.
Here is an illustration of the correspondence for the first entry (entries[0]). First, the lines in md.txt with the line-numbers
28 <L>1<pc>001-1<k1>a<k2>a<h>1
29 {#a#}¦a, {%pn.%} {%root used in the inflexion of%} idam
30 {%and in some particles%}: a-tra, a-tha.
31 <LEND>
Now, entry[0] corresponds to these lines. Here's how:
entry.metaline <-> <L>1<pc>001-1<k1>a<k2>a<h>1
entry.datalines[0] <-> {#a#}¦a, {%pn.%} {%root used in the inflexion of%} idam
entry.datalines[1] <-> {%and in some particles%}: a-tra, a-tha.
entry.lend <-> <LEND>
Please take a look at show_entry.py, and try
python show_entry.py 0 temp_md_0.txt temp_entry.txt
and look at the output.
We are wanting to generate changes only for those lines like {%[PageX]%}
.
In generate changes, we need to do a loop over the datalines (like the loop in show_entry)
If, in the loop, 'line' matches our pattern, then we need to generate the new line and
generate a change object: change = Change(metaline,lnum,line,newline)
and append to 'changes' array. (The 'metaline' is for a comment string when we
write the change).
The init function in Change class will need to be changed.
And then the write_changes function will have to be changed to properly write each change.
This concludes first set of hints/suggestions.
An entry object corresponds to the sequence of lines in md.txt starting with a metaline (line beginning with
<L>
) and ending with the line beginning with<LEND>
.
Dear Jim, Many thanks for this explanation, partly I understood this from the file digentry.py, now it is absolutely clear. Just one question regarding lines which not included in entries and located berween them, like the line [Page104b-1]:
<L>8237<pc>104-a3<k1>wulla<k2>wulla
{#wulla#}¦ṭulla, {%m. N.%}.
<LEND>
<H>{#Wa#} ṬH.
[Page104b-1]
<L>8238<pc>104-a3<k1>WakAra<k2>WakAra
{#WakAra#}¦ṭha-kāra, {%m.%} the letter {%th.%}.
<LEND>
Do we also treat those lines like datalines ?
I have done correction of our program file, could you kindly check the updated make_change_pc_2b,py and give me one more set of hints.
No, we do not treat lines between an <LEND>
and the next <L>...
as datalines.
When we construct the list of entries via entries = digentry.init(filein)
,
that init function in digentry.py ignores the in-between lines.
These lines of xxx.txt between entries currently serve no visible purpose either in
If for some reason we did want to change some in-between line, we would have to construct a program differently, or do the change manually by editing xxx.txt.
You're quite close to a solution, but are having problems which maybe could be characterized as 'scope' problems.
In generate_changes, we only want to add a change for lines that start with {%[Page
.
if line.startswith('{%[Page'):
newline = re.sub(r"(\{\%)|(\%\})", '', line) # I am not sure
lnum = linenum1 + iline + 1
# we should mention metaline, but I don't know how WRONG INDENTATION
change = Change(metaline,lnum,line,newline) # DITTO
changes.append(change) # DITTO
Your newline and lnum lines are correctly indented under the 'if' to only be executed for lines satisfying the 'if' condition, but the next three lines also should be indented one more space so they also will be executed only for lines satisfying the 'if' condition.
In the lines above, you are uncertain how to mention metaline.
When the program executes Change(metaline,lnum,line,newline)
, values must have
been set for the arguments of Change (e.g. value must be set for metaline, for lnum,
for line, and for newline). So where does the value for 'metaline' come from?
Well, these lines occur within the 'for iline,...' loop, which is also within the 'for entry' loop. So in particular, we can use the 'entry' object to get a value. 'entry' object has a 'metaline' attribute (refer to Entry class init method in digentry.py). So, entry.metaline is available. so we can set the value of local variable 'metaline' to be the same as 'entry.metaline':
for entry in entries:
for iline,line in enumerate(entry.datalines):
#lnum = linenum1 + iline + 1
if line.startswith('{%[Page'):
newline = re.sub(r"(\{\%)|(\%\})", '', line) # I am not sure
lnum = linenum1 + iline + 1
# we should mention metaline, but I don't know how
metaline = entry.metaline # <<< set value for local metaline variable
change = Change(metaline,lnum,line,newline)
changes.append(change)
If you run the program as above, then you will see an error message:
lnum = linenum1 + iline + 1
NameError: name 'linenum1' is not defined
This tells you that the local variable linenum1 does not have a value. Do you see where to get a value for linenum1?
With the changes above, you probably have a correctly functioning generate_changes program, which returns a changes list (which in this case has 13 Change objects).
Then the main program calls
write_changes(fileout,changes,title)
.
The only values which you can use in this function are those appearing in the
parameter list of the function -- namely the
value for fileout, the value for changes, and the value for title.
write_changes does a for loop over the changes list: for change in changes:
In this loop, you can make full use of the attribute values of the 'change' object:
i.e., you can use change.metaline, change.lnum, change.line, and change.newline.
But note that there is no value for 'entry' --- This isn't a problem, because the only values we need to generate the lines of outarr for this change are in those four change attribute values.
I'll let you work to modify write_changes.
In this loop, you can make full use of the attribute values of the 'change' object: i.e., you can use change.metaline, change.lnum, change.line, and change.newline.
Dear Jim, Thank you very much for your detailed explanations, it helps me to understand the core of the process. Some solutions which I had where done just using similar cases from previous programs, but now step by step I feel more confident (of course more confident in comparison with zero level).
Now the program works. You can see change_2b.txt
Now the program works. You can see change_2b.txt
Not only handsome, but smart - what else can we ask for? Thanks, Anna. Thanks, Jim.
@AnnaRybakovaT Excellent! Everything looks fine, and I have installed the changes into csl-orig.
This task now complete.
Some of the Page break codings in md.txt are identified as italic: For example under headword 'om':
{%[Page60-1]%}
https://sanskrit-lexicon.uni-koeln.de/simple/md/om(In the cologne digitizations,
{%X%}
is the usual way to indicate the the text X is to be rendered as italic.)The should be changed by removing the italic markup, e.g.
{%[Page60-1]%}
-->[Page60-1]
The objective is to create a change file that does all the corrections.