Italicized page errors in md.txt

funderburkjim commented 2 years ago

Some of the Page break codings in md.txt are identified as italic: For example under headword 'om':
{%[Page60-1]%} https://sanskrit-lexicon.uni-koeln.de/simple/md/om

(In the cologne digitizations, {%X%} is the usual way to indicate the the text X is to be rendered as italic.)

The should be changed by removing the italic markup, e.g. {%[Page60-1]%} --> [Page60-1]

The objective is to create a change file that does all the corrections.

funderburkjim commented 2 years ago

Set up

The framework used in the 'pcerror' corrections (#7) can be adapted to this page error situation.

@AnnaRybakovaT The first step for you is to create a new 'step2b' directory, similar to the step2a directory. Everything will be done in this step2b directory. And needed step2b files can be adapted from the corresponding step2a files.

First, modify readme.txt and in particular decide how a program to generate the changes should be invoked. In this case, we don't need an analog of the 'pcerrors.txt' file.

Then, get a fresh copy of the latest md digitization, as you did with step2a.

Then make your python step2b program, which will be an adaptation of the test_make_change_pc.py program you developed in step2a.

remove code that pertains to pcerrors.txt file, since we won't use that.
Let your generate_changes function just return an empty list.
Try running you modified program, until it just results in 0 records written to ....

When you've got it running, we'll develop the code needed to generate the changes for our current italicized page errors.

The instructions above are intentionally vague, because I want you to learn the important task of how to adapt a framework (like that of step2a) to a new situation.

But of course I'll provide more specific hints if you get stuck with this setup step.

gasyoun commented 2 years ago

to learn the important task of how to adapt a framework (like that of step2a) to a new situation.

I'm your fan.

AnnaRybakovaT commented 2 years ago

But of course I'll provide more specific hints if you get stuck with this setup step.

Dear Jim, To be honest I spent some hourse trying to find solution, probably I got stuck on wrong way and just waisted time. So I really need your hints. There are my ideas.

First of all our file change2b.txt can have such structure: ; ------------------------------------- ; {%[Page210-2]%} 85319 old {%[Page210-2]%} ; 85319 new [Page210-2]
Since we don't need any extra data, I suppose our program to generate the changes can be invoked by: python make_change_pc2b.py temp_md_0.txt change_2b.txt
I have modified partly our program file (make_change_pc_2b.py). As I know in generate_changes function we should use RegEx. The easiest option is re.search (please see this test example): But when I run our program I have error message:
```
x = re.search(r"\{\%\[Page", entry)
File "C:\Users\Rybakova\AppData\Local\Programs\Python\Python38\lib\re.py", line 201, in search
return _compile(pattern, flags).search(string)
TypeError: expected string or bytes-like object
```
Maybe I do something absolutely wrong. Please help me to find a correct way.

funderburkjim commented 2 years ago

@AnnaRybakovaT Good progress. I especially like the way you show a mock-up of what the output should look like.

Please do a pull, as I added a 'show_entry.py' program (see discussion below). Two minor quibbles:

I think your latest program is 'make_change_pc_2b.py', but the spelling in readme.txt is '...pc2b.py'.
Your 'get_title' function is fine, but you should delete the 'pcrecs' argument, both in the function and where it is called in main.

A dummy generate_changes

def generate_changes(entries):
 changes = [] # computed by this function
 for entry in entries:
  pass
 print(len(changes),'lines that may need changes')
 return changes

This will return an empty list, and the program will run, but the output will only contain the title.

The relation between 'entry' and md.txt

You need to know how an 'entry' object corresponds to a set of lines in md.txt. Here's an illustration that may help.
An entry object corresponds to the sequence of lines in md.txt starting with a metaline (line beginning with <L>) and ending with the line beginning with <LEND>.

entries[0] corresponds to the first entry, entries[1] to the second entry, etc.

Here is an illustration of the correspondence for the first entry (entries[0]). First, the lines in md.txt with the line-numbers

28 <L>1<pc>001-1<k1>a<k2>a<h>1
29 {#a#}¦a, {%pn.%} {%root used in the inflexion of%} idam 
30 {%and in some particles%}: a-tra, a-tha.
31 <LEND>

Now, entry[0] corresponds to these lines. Here's how:

entry.metaline      <-> <L>1<pc>001-1<k1>a<k2>a<h>1
entry.datalines[0]  <-> {#a#}¦a, {%pn.%} {%root used in the inflexion of%} idam 
entry.datalines[1]  <-> {%and in some particles%}: a-tra, a-tha.
entry.lend          <-> <LEND>

Please take a look at show_entry.py, and try python show_entry.py 0 temp_md_0.txt temp_entry.txt and look at the output.

funderburkjim commented 2 years ago

We are wanting to generate changes only for those lines like {%[PageX]%}. In generate changes, we need to do a loop over the datalines (like the loop in show_entry) If, in the loop, 'line' matches our pattern, then we need to generate the new line and generate a change object: change = Change(metaline,lnum,line,newline) and append to 'changes' array. (The 'metaline' is for a comment string when we write the change). The init function in Change class will need to be changed.

And then the write_changes function will have to be changed to properly write each change.

This concludes first set of hints/suggestions.

AnnaRybakovaT commented 2 years ago

An entry object corresponds to the sequence of lines in md.txt starting with a metaline (line beginning with <L>) and ending with the line beginning with <LEND>.

Dear Jim, Many thanks for this explanation, partly I understood this from the file digentry.py, now it is absolutely clear. Just one question regarding lines which not included in entries and located berween them, like the line [Page104b-1]:

<L>8237<pc>104-a3<k1>wulla<k2>wulla
{#wulla#}¦ṭulla, {%m. N.%}. 
<LEND>
<H>{#Wa#} ṬH.
[Page104b-1]

<L>8238<pc>104-a3<k1>WakAra<k2>WakAra
{#WakAra#}¦ṭha-kāra, {%m.%} the letter {%th.%}.
<LEND>

Do we also treat those lines like datalines ?

I have done correction of our program file, could you kindly check the updated make_change_pc_2b,py and give me one more set of hints.

funderburkjim commented 2 years ago

lines not in an entry

No, we do not treat lines between an <LEND> and the next <L>... as datalines.

When we construct the list of entries via entries = digentry.init(filein), that init function in digentry.py ignores the in-between lines.

These lines of xxx.txt between entries currently serve no visible purpose either in

the construction of an xml file (xxx.xml) from xxx.txt
the displays of the dictionary (which are based on xxx.xml).

If for some reason we did want to change some in-between line, we would have to construct a program differently, or do the change manually by editing xxx.txt.

funderburkjim commented 2 years ago

You're quite close to a solution, but are having problems which maybe could be characterized as 'scope' problems.

'if' scope

In generate_changes, we only want to add a change for lines that start with {%[Page.

   if line.startswith('{%[Page'):
    newline = re.sub(r"(\{\%)|(\%\})", '', line) # I am not sure
    lnum = linenum1 + iline + 1
   # we should mention metaline, but I don't know how  WRONG INDENTATION
   change = Change(metaline,lnum,line,newline)  # DITTO
   changes.append(change) # DITTO

Your newline and lnum lines are correctly indented under the 'if' to only be executed for lines satisfying the 'if' condition, but the next three lines also should be indented one more space so they also will be executed only for lines satisfying the 'if' condition.

funderburkjim commented 2 years ago

'for' scope

In the lines above, you are uncertain how to mention metaline. When the program executes Change(metaline,lnum,line,newline), values must have been set for the arguments of Change (e.g. value must be set for metaline, for lnum, for line, and for newline). So where does the value for 'metaline' come from?

Well, these lines occur within the 'for iline,...' loop, which is also within the 'for entry' loop. So in particular, we can use the 'entry' object to get a value. 'entry' object has a 'metaline' attribute (refer to Entry class init method in digentry.py). So, entry.metaline is available. so we can set the value of local variable 'metaline' to be the same as 'entry.metaline':

 for entry in entries:
  for iline,line in enumerate(entry.datalines):
   #lnum = linenum1 + iline + 1 
   if line.startswith('{%[Page'):
    newline = re.sub(r"(\{\%)|(\%\})", '', line) # I am not sure
    lnum = linenum1 + iline + 1
    # we should mention metaline, but I don't know how
    metaline = entry.metaline   # <<< set value for local metaline variable
    change = Change(metaline,lnum,line,newline)
    changes.append(change)

linenum1 problem

If you run the program as above, then you will see an error message:

    lnum = linenum1 + iline + 1
NameError: name 'linenum1' is not defined

This tells you that the local variable linenum1 does not have a value. Do you see where to get a value for linenum1?

funderburkjim commented 2 years ago

function scope

With the changes above, you probably have a correctly functioning generate_changes program, which returns a changes list (which in this case has 13 Change objects).

Then the main program calls write_changes(fileout,changes,title). The only values which you can use in this function are those appearing in the parameter list of the function -- namely the value for fileout, the value for changes, and the value for title.

write_changes does a for loop over the changes list: for change in changes: In this loop, you can make full use of the attribute values of the 'change' object: i.e., you can use change.metaline, change.lnum, change.line, and change.newline.

But note that there is no value for 'entry' --- This isn't a problem, because the only values we need to generate the lines of outarr for this change are in those four change attribute values.

I'll let you work to modify write_changes.

AnnaRybakovaT commented 2 years ago

In this loop, you can make full use of the attribute values of the 'change' object: i.e., you can use change.metaline, change.lnum, change.line, and change.newline.

Dear Jim, Thank you very much for your detailed explanations, it helps me to understand the core of the process. Some solutions which I had where done just using similar cases from previous programs, but now step by step I feel more confident (of course more confident in comparison with zero level).

Now the program works. You can see change_2b.txt

gasyoun commented 2 years ago

Now the program works. You can see change_2b.txt

Not only handsome, but smart - what else can we ask for? Thanks, Anna. Thanks, Jim.

funderburkjim commented 2 years ago

@AnnaRybakovaT Excellent! Everything looks fine, and I have installed the changes into csl-orig.

This task now complete.

sanskrit-lexicon / MD