Closed funderburkjim closed 7 months ago
Working directory for this issue: pwkissues/issue102.
temp_pw_17a.txt -- 4 changes from temp_pw_ab_17.txt above. see changes_17a.txt.
@Andhrabharati Do you have a revision for pwkvn that I should apply to the cdsl version? I should incorporate your pwkvn changes before attempting to merge pwkvn into pwk.
Do you have a revision for pwkvn that I should apply to the cdsl version?
Yes I do, @funderburkjim ! I did some work, esp. to bring the pwkvn to the same format as the main pw.txt (apart from many other points).
BTW, I see that the transcoder file is giving some errors now on the revised file (and outputs a file just upto the first metaline, but not inclusive, only!!), which I wanted to use for "proofing" the pwkvn file once--
C:\pw-transcode> python pw_transcode.py slp1 deva .\pwkvn.txt .\pwkvn_deva.txt
Traceback (most recent call last):
File "C:\pw-transcode\pw_transcode.py", line 149, in <module>
lineout = convert_metaline(line,tranin,tranout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\pw-transcode\pw_transcode.py", line 71, in convert_metaline
k1a = transcode(k1,tranin,tranout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\pw-transcode\pw_transcode.py", line 56, in transcode
y = transcoder.transcoder_processString(x,tranin,tranout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\pw-transcode\transcoder.py", line 257, in transcoder_processString
transcoder_fsm(from1,to)
File "C:\pw-transcode\transcoder.py", line 74, in transcoder_fsm
tree = ET.parse(filein)
^^^^^^^^^^^^^^^^
File "C:\Program Files\Python312\Lib\xml\etree\ElementTree.py", line 1203, in parse
tree.parse(source, parser)
File "C:\Program Files\Python312\Lib\xml\etree\ElementTree.py", line 568, in parse
self._root = parser._parse_whole(source)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 1, column 0
PS C:\pw-transcode>
Could you pl. tell why the problem is occurring? Same problem occurs with the pw_AB file as well.
Anyway, here is the pwkvn file to "adopt" for cdsl usage-- pwkvn_AB v.1.zip
Now, coming to my pw_AB v.2 file, I had already mentioned about it earlier.
I see that the original typed text also has the volume-page notation, as seen at the text given by Thomas recently, while commenting about the das. abbreviation.
The dot between volume & page got missed in the present pw.txt, and probably Jim might not mind bringing it back.
Otherwise, I shall revert the correction in my v.2 file, for giving it out (to start the next phase of corrections in cdsl pw.text)
the transcoder file is giving some errors now on the revised file
I do not find this problem when I run locally for pwkvn.
This may be a python version problem. My local version is 3.9.1
And your error message shows Python312. (version 3.12).
Can you use version 3.9 of python?
Also note -- While my local conversion of pwkvn gave no error, I DID find an
'invertibility' problem -- as for instance {#SruteratiTI-<lb/>kftA#}
-- the
problem is with the <lb/>
within {#X#}.
The dot between volume & page got missed in the present pw.txt, and probably Jim might not mind bringing it back.
{%niederwerfen…,…niederhauen%} , [Page4.013-3] (FROM THOMAS COMMENT)
I think that '.' in [Page4.013-3]
was made by Thomas for his convenience (he was confused by [Page4013-3] since there is no page 4013 in the pdf.)
The '.' is not part of pw.txt, nor of the display of pw. Nor has it been previously.
So there is nothing to 'bring back'.
Even I got confused with this at times; so looked around and thought of changing the pc as v-p-c, as in other cdsl works.
Even if not to "bring back", would you mind changing it @funderburkjim ? [Of course, as I had already mentioned in my above posting, it might need changes at many places, not a single (and easy) task!]
@funderburkjim
Uninstalled Python 3.12 and installed Python 3.9; but still the same error appears for me--
PS C:\pw-transcode> python pw_transcode.py slp1 deva .\pwkvn.txt .\pwkvn_deva.txt Traceback (most recent call last): File "C:\pw-transcode\pw_transcode.py", line 149, in
lineout = convert_metaline(line,tranin,tranout) File "C:\pw-transcode\pw_transcode.py", line 71, in convert_metaline k1a = transcode(k1,tranin,tranout) File "C:\pw-transcode\pw_transcode.py", line 56, in transcode y = transcoder.transcoder_processString(x,tranin,tranout) File "C:\pw-transcode\transcoder.py", line 257, in transcoder_processString transcoder_fsm(from1,to) File "C:\pw-transcode\transcoder.py", line 74, in transcoder_fsm tree = ET.parse(filein) File "C:\Program Files\Python39\lib\xml\etree\ElementTree.py", line 1224, in parse tree.parse(source, parser) File "C:\Program Files\Python39\lib\xml\etree\ElementTree.py", line 580, in parse self._root = parser._parse_whole(source) xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 1, column 0 PS C:\pw-transcode>
Would you pl. give me the converted pwkvn_deva file for now, so that I can start proofing the same?
BTW, where did you find {#SruteratiTI-<lb/>kftA#}
?
My file has only {#SruteratiTIkftA#}
!!
I see that the 'pwkvn' file also uses the 'v-page-col' form for 'pc'.
When I get to the task of integrating pwkvn into pw, then maybe will be the time to change to change 'pc' in pw.txt from 'vpage-c' to 'v-page-col'.
{#SruteratiTI-<lb/>kftA#}
appears in the current csl-orig pwkvn.txt at line 1094
I see that the 'pwkvn' file also uses the 'v-page-col' form for 'pc'.
When I get to the task of integrating pwkvn into pw, then maybe will be the time to change to change 'pc' in pw.txt from 'vpage-c' to 'v-page-col'.
Good to hear this!
So, shall I post my v.2 file as is now? [along with the steps involved in converting ab_17 to that form] Or, should wait till your perusal of my pwkvn work is over?
Version confusion!
In the comments above, we've mentioned both pw and pwkvn. and we've mentioned both AB.V1. and AB.V2. You have uploaded a pwkvn_AB_v.1.txt. You have requested pwkvn_deva from me. Should this be a conversion of
You asked shall I post my v.2 file as is now? [along with the steps involved in converting ab_17 to that form]
.
My copy pwkvn_AB_v.1.txt
I had made v.2 (long back, over 2 months ago) from my v.1 file that was posted initially for the abbr. work.
And I have been updating the same with your successive steps from 1 to 16, so far.
Shall post the file tomorrow, as I had just shutdown my system and on my mobile now.
post the file tomorrow
-- Sounds good.
The python errors above are occurring at line 74 of transcoder.py
tree = ET.parse(filein)
, where filein is the name of one of the transcoder files (in the transcoder directory).
the et_example folder contains a published simple example of using ET.parse.
@Andhrabharati If you try this example on your local system, Does it work?
Here is the result--
PS C:\pw-transcode\test> python test1.py data {} Liechtenstein 1 Singapore 4 Panama 68 PS C:\pw-transcode\test>
Step-1: merging the separate [Pagexxx] lines "into" the other lines.
(a) <LEND>\n[
-> <LEND> [
;; 682113 -> 679088 (-3025)
(b) \[Page(.*?)\]\n<LEND>
-> <LEND> \[Page\1\]
;; 679088 -> 679068 (-20)
(c) ([^\n])\n\[Page(.*?)\]\n
-> \1 \[Page\2\]
;; 679068 -> 673636 (-5432)
(d) ] <div n=
-> ]\n<div n=
;; 673636 -> 674062 (426)
Now we have equal line numbers (674062) in pw_ab_17a and pw (AB v.2), facilitating comparison.
Step-2: merging consecutive <ls n="Chr.
</ls>. <ls n="Chr.(.*?)">
-> '. '
Step-3: removing the italic terminations around [Pagexxx]
%} \[Page(.*?)\] {%
-> ' [Page\1] '
Step-4: Changing the page & column numbers after <pc>
and [Page
(a) Insert a '-' after the first (volume) digit. (b) Change the ending (column) digit -[123] to a letter -[abc] resp. After these changes, we have the two files differing in about 7000+ lines.
temp_pw_ab_17a.zip and pw (AB v2).zip
The majority of changes are-- (a) fetching more 'grouped' HWs in the file (b) clubbing of ls-entities together (c) punctuation
@Andhrabharati
Have been able to programmatically reproduce your temp_pw_ab_17a.txt from temp_pw_17a.txt. Work in issue102/step1.
Your description of these changes of great help!
See change_diff_4.txt for 21 additional corrections you made but did not mention.
I'll switch to pwkvn now (#103) before further investigation of your changes in temp_pw_AB_v2.txt
@funderburkjim
All the 20 "[Pagexxx]" changes mentioned in your addl. corrections file above were "included" in the Step-4 in my notes.
And then there is one mistake (!?) in my file, which you have noted at <L>89441<pc>5-116-a<k1>yajus<k2>ya/jus<e>100
<L>89441<pc>5-116-a<k1>yajus<k2>ya/jus<e>100
440422 old <div n="2">— d〉 <ab>Bez.</ab> {%eines <ab>best.</ab> Spruches%} <ls>NṚS. TĀP. UP.</ls> (in der <ls>Bibl. ind.</ls>) <ls n="Chr.">1,3</ls> (zweimal).
;
440422 new <div n="2">— d〉 <ab>Bez.</ab> {%eines <ab>best.</ab> Spruches%} <ls>NṚS. TĀP. UP.</ls> (in der <ls>Bibl. ind.. 1,3</ls> (zweimal).
;; AB note
to change the complex
old <ls>NṚS. TĀP. UP.</ls> (in der <ls>Bibl. ind.</ls>) <ls n="Chr.">1,3</ls>
from
new <ls>NṚS. TĀP. UP.</ls> (in der <ls>Bibl. ind.. 1,3</ls>
to the simpler
new <ls>NṚS. TĀP. UP. 1,3</ls> (in der <ls>Bibl. ind.</ls>)
similar to the 475321 content <ls>NṚS. UP. 1,3</ls> in der <ls>Bibl. ind.</ls>
[as done in my v.2 file]
The work is done in the issue102/step2 directory.
The first small step was to remove the ';;' comments in AB version and make corresponding changes in cdsl version, resulting in temp_pw_v1_0.txt and temp_pw_v2_0.txt (v1=cdsl, v2=AB).
See diff_AB_v2_v2_0.txt for diff temp_pw_AB_v2.txt temp_pw_v2_0.txt
.
change_v2_1.txt (39 changes) documents the further changes to AB version temp_pw_v2_0.txt.
change_v1_1.txt (7252 changes) documents the further changes to cdsl version temp_pw_v1_0.txt.
Respectively applying these changes yields temp_pw_v1_1.txt and temp_pw_v2_1.txt.
diff temp_pw_v1_1.txt temp_pw_v2_1.txt | wc -l 0 diffs. These files are identical, so all differences are resolved.
This is the version pushed to csl-orig repository at the commit mentioned in above comment.
Other repositories also required some change for xml-validation and proper behavior of the displays. Notably, the new [Page v-ppp-c]
format is now used for page links, as requested in a previous comment.
@Andhrabharati I think this issue may now be closed. Agree?
@funderburkjim
Out of the 32 Misc. changes done in the AB version, I've noticed 5 corrections (at 36207, 36215, 324234, 339882 and 426957)-- corrections.txt
You may correct these in the cdsl text also.
The rest are mostly related to italic marking, to which I deliberately didn't pay much attention earlier (having thought of doing a full text reading once; and I would probably take this up quite soon).
Glad that the CDSL and AB versions are now tallying!!
I have just returned home from a long journey (and too tired), and shall look at the rest of the actions (that you had taken) tomorrow.
@Andhrabharati @maltenth has been working on two types of corrections
<bot>
tags.Let's defer your further work (including your small 'corrections' file) until this work with Thomas is finished. Thus, I'm closing this issue.
BTW, I am doubtful of the 'print change' suggestions of your corrections file, but we can discuss further in another issue.
This issue continues the revisions of PW digitization at #88, based upon work done by @Andhrabharati. We start with AB's temp_pw_ab_17.zip