sanskrit-lexicon / csl-pywork

A template for creating pywork repository for each dictionary.
3 stars 1 forks source link

PW / PWG xml errors while regeneration #27

Closed drdhaval2785 closed 2 years ago

drdhaval2785 commented 3 years ago

Almost all the lines of PW and PWG give errors, while doing regen. Maybe somehow there is something missing in dtd file which is PW / PWG specific.

Last line of PWG is pasted for reference.

<!-- xml error #101504: L = 122729, hw = hastagrABa-->
datalines = 
{#hastagrABa/#}¦ <lex>m.</lex> so v. a. {#hastagrAha#} 
<div n="1"> 2) 
<ls>ṚV. 10, 18, 8.</ls>
xmlstring=
<H1><h><key1>hastagrABa</key1><key2>hastagrABa/</key2></h><body><s>hastagrABa/</s>  <lex>m.</lex> so v. a. <s>hastagrAha</s>  <div n="1"> 2)  <ls>ṚV. 10, 18, 8.</ls></div></body><tail><L>122729</L><pc>7-1822</pc></tail></H1>
gasyoun commented 3 years ago

Only @funderburkjim can tell.

funderburkjim commented 3 years ago

I had the same problem at Cologne for both PW and PWG.

It turned out to be a problem with the Python version used in redo_xml.sh, and in particular the python running make_xml.py.

Within a script at Cologne, the default python is currently 2.7.5 With this version (but not with the 2.6.6 version previously at Cologne), the ET module of python said that a generated xml record was invalid somehow, and spit out errors.

The change made AT COLOGNE, was to change 'python make_xml.py' to 'python3 make_xml.py'. At Cologne, python3 is version 3.6.8. This solves the problem at Cologne.

This change is only made for Cologne execution (via mako variable). In non-Cologne local installation, the make_xml.sh script cannot reliably be similarly changed because we don't know if 'python3' makes sense locally.

My local machine python is version 3.7.0 and 'python make_xml.py' works fine .

To summarize, the problem is solved at Cologne.

Incidentally, one thing I tried at Cologne was to remove the error messages at the ET test in make_xml.py, and to run the program with python2 (= version 2.7.5 at Cologne). It turned out that the resulting pw.xml (or pwg.xml) was constructed IDENTICALLY to the python3 construction. In other words, the problem was in the erroneous error message from ET module in version 2.7.5.

How do the above comments relate to your problem? Any suggestions?

funderburkjim commented 3 years ago

Made a change in make_xml.py template, to avoid printing out the ET errors.

Then regenerated, at Cologne, pw in normal (python3 way). There is one informational message: All records parsed by ET . The file created by make_xml.py is scans/PWScan/2020/pywork/pw0.xml.

Then, did a regeneration using python2 at Cologne, with a warning message:

python2 make_xml.py ../orig/pw.txt pwhw.txt pw0_py2.xml
WARNING: make_xml.py: 79804 records records not parsed by ET
diff pw0.xml pw0_py2.xml

The diff showed that the python2 output file is identical to the python3 output file.

So all those 79804 ET parsing errors were bogus!

If there really is some XML error, then this will be detected at Cologne by the xmllint command which checks that pw.xml conforms to pw.dtd.

Locally, xmllint may not be available (e.g. on Windows). I have a simple workaround program (xmlvalidate.py) in my local 'cologne' directory that I use to check local validity of xml when I have any doubt. in cologne/pw/pywork I would run python ../../xmlvalidate.py pw.xml pw.dtd and expect to get 'OK' if validation is confirmed, or some kind of error message otherwise.

Suggest Dhaval pull the latest csl-pywork, and see if that solves his local installation problem. Incidentally, what is the python version in local installation?

gasyoun commented 3 years ago

erroneous error message from ET module in version 2.7.5.

So funny is this whole Python 2 vs. Python 3 thing. Had fun with it outside Cologne scripts lately as well.

So all those 79804 ET parsing errors were bogus!

))

drdhaval2785 commented 2 years ago

Local installation worked without errors. Closing the issue.