sanskrit-lexicon / PWK

Sanskrit-Wörterbuch in kürzerer Fassung, 7 Bände Petersburg 1879-1889
3 stars 1 forks source link

pw revisions based on AB version(s), continued #102

Closed funderburkjim closed 7 months ago

funderburkjim commented 7 months ago

This issue continues the revisions of PW digitization at #88, based upon work done by @Andhrabharati. We start with AB's temp_pw_ab_17.zip

funderburkjim commented 7 months ago

Working directory for this issue: pwkissues/issue102.

temp_pw_17a.txt -- 4 changes from temp_pw_ab_17.txt above. see changes_17a.txt.

funderburkjim commented 7 months ago

@Andhrabharati Do you have a revision for pwkvn that I should apply to the cdsl version? I should incorporate your pwkvn changes before attempting to merge pwkvn into pwk.

Andhrabharati commented 7 months ago

Do you have a revision for pwkvn that I should apply to the cdsl version?

Yes I do, @funderburkjim ! I did some work, esp. to bring the pwkvn to the same format as the main pw.txt (apart from many other points).

BTW, I see that the transcoder file is giving some errors now on the revised file (and outputs a file just upto the first metaline, but not inclusive, only!!), which I wanted to use for "proofing" the pwkvn file once--

C:\pw-transcode> python pw_transcode.py slp1 deva .\pwkvn.txt .\pwkvn_deva.txt
Traceback (most recent call last):
  File "C:\pw-transcode\pw_transcode.py", line 149, in <module>
    lineout = convert_metaline(line,tranin,tranout)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\pw-transcode\pw_transcode.py", line 71, in convert_metaline
    k1a = transcode(k1,tranin,tranout)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\pw-transcode\pw_transcode.py", line 56, in transcode
    y = transcoder.transcoder_processString(x,tranin,tranout)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\pw-transcode\transcoder.py", line 257, in transcoder_processString
    transcoder_fsm(from1,to)
  File "C:\pw-transcode\transcoder.py", line 74, in transcoder_fsm
    tree = ET.parse(filein)
           ^^^^^^^^^^^^^^^^
  File "C:\Program Files\Python312\Lib\xml\etree\ElementTree.py", line 1203, in parse
    tree.parse(source, parser)
  File "C:\Program Files\Python312\Lib\xml\etree\ElementTree.py", line 568, in parse
    self._root = parser._parse_whole(source)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 1, column 0
PS C:\pw-transcode>

Could you pl. tell why the problem is occurring? Same problem occurs with the pw_AB file as well.

Andhrabharati commented 7 months ago

Anyway, here is the pwkvn file to "adopt" for cdsl usage-- pwkvn_AB v.1.zip

Andhrabharati commented 7 months ago

Now, coming to my pw_AB v.2 file, I had already mentioned about it earlier.

I see that the original typed text also has the volume-page notation, as seen at the text given by Thomas recently, while commenting about the das. abbreviation. image

The dot between volume & page got missed in the present pw.txt, and probably Jim might not mind bringing it back.

Otherwise, I shall revert the correction in my v.2 file, for giving it out (to start the next phase of corrections in cdsl pw.text)

funderburkjim commented 7 months ago

the transcoder file is giving some errors now on the revised file

I do not find this problem when I run locally for pwkvn.

This may be a python version problem. My local version is 3.9.1

And your error message shows Python312. (version 3.12).

Can you use version 3.9 of python?

Also note -- While my local conversion of pwkvn gave no error, I DID find an 'invertibility' problem -- as for instance {#SruteratiTI-<lb/>kftA#} -- the problem is with the <lb/> within {#X#}.

funderburkjim commented 7 months ago

The dot between volume & page got missed in the present pw.txt, and probably Jim might not mind bringing it back.

{%niederwerfen…,…niederhauen%} , [Page4.013-3]  (FROM THOMAS COMMENT)

I think that '.' in [Page4.013-3] was made by Thomas for his convenience (he was confused by [Page4013-3] since there is no page 4013 in the pdf.)

The '.' is not part of pw.txt, nor of the display of pw. Nor has it been previously.

So there is nothing to 'bring back'.

Andhrabharati commented 7 months ago

Even I got confused with this at times; so looked around and thought of changing the pc as v-p-c, as in other cdsl works.

Even if not to "bring back", would you mind changing it @funderburkjim ? [Of course, as I had already mentioned in my above posting, it might need changes at many places, not a single (and easy) task!]

Andhrabharati commented 7 months ago

@funderburkjim

Uninstalled Python 3.12 and installed Python 3.9; but still the same error appears for me--

PS C:\pw-transcode> python pw_transcode.py slp1 deva .\pwkvn.txt .\pwkvn_deva.txt Traceback (most recent call last): File "C:\pw-transcode\pw_transcode.py", line 149, in lineout = convert_metaline(line,tranin,tranout) File "C:\pw-transcode\pw_transcode.py", line 71, in convert_metaline k1a = transcode(k1,tranin,tranout) File "C:\pw-transcode\pw_transcode.py", line 56, in transcode y = transcoder.transcoder_processString(x,tranin,tranout) File "C:\pw-transcode\transcoder.py", line 257, in transcoder_processString transcoder_fsm(from1,to) File "C:\pw-transcode\transcoder.py", line 74, in transcoder_fsm tree = ET.parse(filein) File "C:\Program Files\Python39\lib\xml\etree\ElementTree.py", line 1224, in parse tree.parse(source, parser) File "C:\Program Files\Python39\lib\xml\etree\ElementTree.py", line 580, in parse self._root = parser._parse_whole(source) xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 1, column 0 PS C:\pw-transcode>

Would you pl. give me the converted pwkvn_deva file for now, so that I can start proofing the same?

Andhrabharati commented 7 months ago

BTW, where did you find {#SruteratiTI-<lb/>kftA#}? My file has only {#SruteratiTIkftA#}!!

funderburkjim commented 7 months ago

I see that the 'pwkvn' file also uses the 'v-page-col' form for 'pc'.

When I get to the task of integrating pwkvn into pw, then maybe will be the time to change to change 'pc' in pw.txt from 'vpage-c' to 'v-page-col'.

funderburkjim commented 7 months ago

{#SruteratiTI-<lb/>kftA#} appears in the current csl-orig pwkvn.txt at line 1094

Andhrabharati commented 7 months ago

I see that the 'pwkvn' file also uses the 'v-page-col' form for 'pc'.

When I get to the task of integrating pwkvn into pw, then maybe will be the time to change to change 'pc' in pw.txt from 'vpage-c' to 'v-page-col'.

Good to hear this!

So, shall I post my v.2 file as is now? [along with the steps involved in converting ab_17 to that form] Or, should wait till your perusal of my pwkvn work is over?

funderburkjim commented 7 months ago

Version confusion!

In the comments above, we've mentioned both pw and pwkvn. and we've mentioned both AB.V1. and AB.V2. You have uploaded a pwkvn_AB_v.1.txt. You have requested pwkvn_deva from me. Should this be a conversion of

You asked shall I post my v.2 file as is now? [along with the steps involved in converting ab_17 to that form].

Andhrabharati commented 7 months ago

My copy pwkvn_AB_v.1.txt

Andhrabharati commented 7 months ago

I had made v.2 (long back, over 2 months ago) from my v.1 file that was posted initially for the abbr. work.

And I have been updating the same with your successive steps from 1 to 16, so far.

Shall post the file tomorrow, as I had just shutdown my system and on my mobile now.

funderburkjim commented 7 months ago

post the file tomorrow -- Sounds good.

funderburkjim commented 7 months ago

why the conversion problem?

The python errors above are occurring at line 74 of transcoder.py tree = ET.parse(filein), where filein is the name of one of the transcoder files (in the transcoder directory).

the et_example folder contains a published simple example of using ET.parse.

@Andhrabharati If you try this example on your local system, Does it work?

Andhrabharati commented 7 months ago

Here is the result--

PS C:\pw-transcode\test> python test1.py data {} Liechtenstein 1 Singapore 4 Panama 68 PS C:\pw-transcode\test>

Andhrabharati commented 7 months ago

ab_17 to ab_17a (adjustments)

Step-1: merging the separate [Pagexxx] lines "into" the other lines.

(a) <LEND>\n[ -> <LEND> [ ;; 682113 -> 679088 (-3025) (b) \[Page(.*?)\]\n<LEND> -> <LEND> \[Page\1\] ;; 679088 -> 679068 (-20) (c) ([^\n])\n\[Page(.*?)\]\n -> \1 \[Page\2\] ;; 679068 -> 673636 (-5432) (d) ] <div n= -> ]\n<div n= ;; 673636 -> 674062 (426) Now we have equal line numbers (674062) in pw_ab_17a and pw (AB v.2), facilitating comparison.

Step-2: merging consecutive <ls n="Chr.

</ls>. <ls n="Chr.(.*?)"> -> '. '

Step-3: removing the italic terminations around [Pagexxx]

%} \[Page(.*?)\] {% -> ' [Page\1] '

Step-4: Changing the page & column numbers after <pc> and [Page

(a) Insert a '-' after the first (volume) digit. (b) Change the ending (column) digit -[123] to a letter -[abc] resp. After these changes, we have the two files differing in about 7000+ lines.

temp_pw_ab_17a.zip and pw (AB v2).zip

The majority of changes are-- (a) fetching more 'grouped' HWs in the file (b) clubbing of ls-entities together (c) punctuation

funderburkjim commented 7 months ago

@Andhrabharati Have been able to programmatically reproduce your temp_pw_ab_17a.txt from temp_pw_17a.txt. Work in issue102/step1. Your description of these changes of great help!
See change_diff_4.txt for 21 additional corrections you made but did not mention.

I'll switch to pwkvn now (#103) before further investigation of your changes in temp_pw_AB_v2.txt

Andhrabharati commented 7 months ago

@funderburkjim

All the 20 "[Pagexxx]" changes mentioned in your addl. corrections file above were "included" in the Step-4 in my notes.

And then there is one mistake (!?) in my file, which you have noted at <L>89441<pc>5-116-a<k1>yajus<k2>ya/jus<e>100

<L>89441<pc>5-116-a<k1>yajus<k2>ya/jus<e>100 440422 old <div n="2">— d〉 <ab>Bez.</ab> {%eines <ab>best.</ab> Spruches%} <ls>NṚS. TĀP. UP.</ls> (in der <ls>Bibl. ind.</ls>) <ls n="Chr.">1,3</ls> (zweimal). ; 440422 new <div n="2">— d〉 <ab>Bez.</ab> {%eines <ab>best.</ab> Spruches%} <ls>NṚS. TĀP. UP.</ls> (in der <ls>Bibl. ind.. 1,3</ls> (zweimal).

;; AB note to change the complex old <ls>NṚS. TĀP. UP.</ls> (in der <ls>Bibl. ind.</ls>) <ls n="Chr.">1,3</ls> from new <ls>NṚS. TĀP. UP.</ls> (in der <ls>Bibl. ind.. 1,3</ls> to the simpler new <ls>NṚS. TĀP. UP. 1,3</ls> (in der <ls>Bibl. ind.</ls>) similar to the 475321 content <ls>NṚS. UP. 1,3</ls> in der <ls>Bibl. ind.</ls> [as done in my v.2 file]

funderburkjim commented 7 months ago

Resolve the 7000 differences

The work is done in the issue102/step2 directory. The first small step was to remove the ';;' comments in AB version and make corresponding changes in cdsl version, resulting in temp_pw_v1_0.txt and temp_pw_v2_0.txt (v1=cdsl, v2=AB). See diff_AB_v2_v2_0.txt for diff temp_pw_AB_v2.txt temp_pw_v2_0.txt.

change_v2_1.txt (39 changes) documents the further changes to AB version temp_pw_v2_0.txt.

change_v1_1.txt (7252 changes) documents the further changes to cdsl version temp_pw_v1_0.txt.

Respectively applying these changes yields temp_pw_v1_1.txt and temp_pw_v2_1.txt.

diff temp_pw_v1_1.txt temp_pw_v2_1.txt | wc -l 0 diffs. These files are identical, so all differences are resolved.

This is the version pushed to csl-orig repository at the commit mentioned in above comment.

Other repositories also required some change for xml-validation and proper behavior of the displays. Notably, the new [Page v-ppp-c] format is now used for page links, as requested in a previous comment.

image
funderburkjim commented 7 months ago

@Andhrabharati I think this issue may now be closed. Agree?

Andhrabharati commented 7 months ago

@funderburkjim

Out of the 32 Misc. changes done in the AB version, I've noticed 5 corrections (at 36207, 36215, 324234, 339882 and 426957)-- corrections.txt

You may correct these in the cdsl text also.

The rest are mostly related to italic marking, to which I deliberately didn't pay much attention earlier (having thought of doing a full text reading once; and I would probably take this up quite soon).

Glad that the CDSL and AB versions are now tallying!!

I have just returned home from a long journey (and too tired), and shall look at the rest of the actions (that you had taken) tomorrow.

funderburkjim commented 7 months ago

@Andhrabharati @maltenth has been working on two types of corrections

Let's defer your further work (including your small 'corrections' file) until this work with Thomas is finished. Thus, I'm closing this issue.

BTW, I am doubtful of the 'print change' suggestions of your corrections file, but we can discuss further in another issue.