sanskrit-lexicon / csl-pywork

A template for creating pywork repository for each dictionary.
3 stars 1 forks source link

xml regeneration error (python2 / python3) #26

Closed drdhaval2785 closed 2 years ago

drdhaval2785 commented 3 years ago

Today when I was regenerating my local version from csl-orig and csl-pywork repositories, I ran into errors in quite a few of dictionaries. Attaching a log of the for AP90 dictionary.

dhaval@dhaval-Aspire-5750:/var/www/html/cologne/ap90/pywork$ sh 
construct ap90.xml...
xml error: n=5,m line=
<H1><h><key1>a</key1><key2>a</key2></h><body><s>a</s>  The first letter of the Nāgarī <lb/>Alphabet. <s>--aH</s> [<s>avati, atati sAta-</s> <lb/><s>tvena tizWatIti vA; av-at vA, qa</s> Tv.] <b>1</b> N. <lb/>of Viṣṇu, the first of the three <lb/>sounds constituting the sacred <lb/>syllable <s>om; akAro vizRuruddizwa ukAra-</s> <lb/><s>stu maheSvaraH . makArastu smfto brahmA praRavastu</s> <lb/><s>trayAtmakaH ..;</s> for more explanation of <lb/>the three syllables <s>a, u, m</s> see <s>om</s>. <b>--2</b> <lb/>N. of Śiva, Brahmā, Vāyu, or Vaiśvā- <lb/>nara. <i>--ind.</i> <b>1</b> A prefix corresponding <lb/>to Latin <i>in,</i> Eng. <i>in</i> or <i>un,</i> Gr. <i>a</i> or <lb/><i>an,</i> and joined to nouns, adjectives, <lb/>indeclinables (or even to verbs) as <lb/>a substitute for the negative parti- <lb/>cle <s>naY,</s> and changed to <s>an</s> before <lb/>vowels except in the word <s>a-fRin</s>. <lb/>The senses of <s>na</s> usually enumerat- <lb/>ed are six- (<i>a</i>) <s>sAdfSya</s> ‘likeness’ or <lb/>‘resemblance’; <s>abrAhmaRaH</s> one like a <lb/>Brāhmaṇa (wearing the sacred thread <lb/>&amp;c.), but not a Brāhmaṇa, but a <lb/>Kṣatriya, or Vaiśya; <s>anikzu:</s> a reed <lb/>appearing like <s>ikzu,</s> but not a true <s>ikzu</s>. <lb/>(<i>b</i>) <s>aBAva</s> ‘absence’, ‘negation’, ‘want’, <lb/>‘privation’; <s>ajYAnaM</s> absence of know- <lb/>ledge, ignorance; <s>akroDaH, anaMgaH, akaMwakaH,</s> <lb/><s>aGawaH</s> &amp;c. (<i>c</i>) <s>Beda</s> ‘difference’ or ‘dis- <lb/>tinction’; <s>apawaH</s> not a cloth, some- <lb/>thing different from, or other than, <lb/>a cloth. (<i>d</i>) <s>alpatA</s> ‘smallness’, <lb/>‘diminution’, used as a diminutive <lb/>particle; <s>anudarA</s> having a slender <lb/>waist (<s>kfSodarI</s> or <s>tanumaDyamA</s>). (<i>e</i>) <lb/><s>aprASastya</s> ‘badness,’ ‘unfitness,’ <lb/>having a depreciative sense; <s>akAlaH</s> <lb/>wrong or improper time; <s>akAryaM</s> not <lb/>fit to be done, improper, unworthy, <lb/>bad act. (<i>f</i>) <s>viroDa</s> ‘opposition’, ‘con- <lb/>trariety’; <s>anItiH</s> the opposite of <lb/>morality, immorality; <s>asita</s> not  [Page0001-b+ 45] <lb/>white, black; <s>asura</s> not a god, a <lb/>demon &amp;c. These senses are put to- <lb/>gether in the following verse:-- <lb/><s>tatsAdfSyamaBAvaSca tadanyatvaM tadalpatA . aprA-</s> <lb/><s>SastyaM viroDaSca naYarTAH zaw prakIrtitAH ..</s> See <lb/><s>na</s> also. With verbal derivatives, such <lb/>as gerunds, infinitives, participles, it <lb/>has usually the sense of ‘not’; <s>adagDvA</s> <lb/>not having burnt: <s>apaSyan</s> not seeing; <lb/>so <s>asakft</s> not once; <s>amfzA, akasmAt</s> <lb/>&amp;c. Sometimes <s>a</s> does not affect the <lb/>sense of the second member; <s>a-paScima</s> <lb/>that which has no last, <i>i. e.</i> last; <lb/><s>anuttama</s> having no superior, unsur- <lb/>passed, most excellent; for examples <lb/>see the words. <b>--2</b> An interjection of <lb/>(<i>a</i>) Pity (<i>ah!</i>) <s>a avadyaM</s> P. I. 1. 14 <lb/>Sk. (<i>b</i>) Reproach, censure (fie, <lb/>shame); <s>apacasi tvaM jAlma</s> P. VI. 3. 73 <lb/>Vārt. See <s>akaraRi, ajIvani</s> also. <lb/>(<i>c</i>) Used in addressing; <s>a anaMta</s>. <lb/>(<i>d</i>) It is also used as a particle of <lb/>prohibition. <b>--3</b> The augment pre- <lb/>fixed to the root in the formation of <lb/>the Imperfect, Aorist and Condi- <lb/>tional Tenses. <P/><i>N-B.</i> --The application of this priva- <lb/>tive prefix is practically unlimited; to <lb/>give every possible case would almost <lb/>amount to a dictionary itself. No at- <lb/>tempt will, therefore, be made to give <lb/><i>every possible</i> combination of this prefix <lb/>with a following word; only such words <lb/>as require a special explanation, or such <lb/>as most frequently occur in the liter- <lb/>ature and enter into compounds, with <lb/>other words, will be given; others will <lb/>be found self-explaining when the English <lb/>‘in,’ ‘un,’ or ‘not,’ is substituted for <s>a</s> <lb/>or <s>an</s> before the meaning of the second <lb/>word, or the sense may be expressed by <lb/>‘less,’ ‘free from,’ ‘devoid or destitute <lb/>of’ &amp;c; <s>akaTya</s> unspeakable; <s>adarpa</s> with- <lb/>out pride, or freedom from pride; <s>apraga-</s> <lb/><s>lBa</s> not bold; <s>aBaga</s> unfortunate; <s>avitta</s>  [Page0001-c+ 42] <lb/>destitute of wealth &amp;c. &amp;c. In many <lb/>cases such compounds will be found ex- <lb/>plained under the second member. Most <lb/>compounds beginning with <s>a</s> or <s>an</s> are <lb/>either Tatpuruṣa or Bahuvrīhi (to be <lb/>determined by the sense) and should be <lb/>so dissolved.</body><tail><L>1</L><pc>0001-a</pc></tail></H1>

xmllint on ap90.xml...
ap90.xml:6: parser error : Premature end of data in tag ap90 line 4

remaking input.txt...
5 lines from ../ap90.xml
remaking sqlite table...
Error: near line 1: no such table: ap90
0       key     VARCHAR(100)    1               0
1       lnum    DECIMAL(10,2)   0               0
2       data    TEXT    1               0
moving ap90.sqlite to web/sqlite/
1 records read from ../ap90.xml
0 records written to query_dump.txt
drdhaval2785 commented 3 years ago

This issue seems to run deeper. Almost all of my XMLs had only 3-4 entries and abruptly ended.

gasyoun commented 3 years ago

only 3-4 entries and abruptly ended.

@funderburkjim ever seen before?

funderburkjim commented 3 years ago

This does not happen for me.

Terminal session
Jim@Jim-Dell MINGW64 /c/xampp/htdocs/cologne/csl-pywork/v02 (master)
$ sh ap90  ../../ap90
updating for websanlexicon for dictionary ap90 to /c/xampp/htdocs/cologne/ap90
regenerate ap90 headwords
construct xxxhw.txt
437 extra headwords from hwextra/ap90_hwextra.txt
267150 lines read from ../orig/ap90.txt
31751 entries found
32188 lines written to ap90hw.txt
construct xxxhw2.txt
construct xxxhw0.txt
regenerate ap90.xml and postxml files
construct ap90.xml...
xmllint on ap90.xml... line 5: xmllint: command not found
remaking ap90.sqlite from ../ap90.xml with python... dictionary code= ap90
create_index takes 0.22 seconds
32193 lines read from ../ap90.xml
32188 rows written to ap90.sqlite
5.51 seconds for batch size 10000
moving ap90.sqlite to web/sqlite/
32188 records read from ../ap90.xml
31984 records written to query_dump.txt
regenerate downloads
BEGIN: downloads/
remove old
copying files from ../pywork to txt/
create new
BEGIN: downloads/
remove old
copying files from ../pywork to xml/
create new
BEGIN: downloads/
remove old

My python version is 3.7.0 (windows)

The repositories: csl-pywork, csl-websanlexicon, csl-orig are all up to date with Github.

drdhaval2785 commented 3 years ago

I did a bit of analysis. It is due to python2 and python3 differences. I tested the following code from with python2 and python3. python3 did not give any error. python2 gave the error I mentioned earlier.

dhaval@dhaval-Aspire-5750:/var/www/html/cologne/acc/pywork$ python --version
Python 2.7.17
dhaval@dhaval-Aspire-5750:/var/www/html/cologne/acc/pywork$ python ../orig/acc.txt acchw.txt acc.xml # > redoxml_log.txt
xml error: n=8,m line=
<H1><h><key1>aMSumadBedasaMgraha</key1><key2>aMSumadBedasaMgraha</key2></h><body><s>aMSumadBedasaMgraha</s>  vedānta, ascribed to Kaśyapa. Oppert 5875.</body><tail><L>4</L><pc>1-001,1</pc></tail></H1>

dhaval@dhaval-Aspire-5750:/var/www/html/cologne/acc/pywork$ python3 --version
Python 3.6.9
dhaval@dhaval-Aspire-5750:/var/www/html/cologne/acc/pywork$ python3 ../orig/acc.txt acchw.txt acc.xml # > redoxml_log.txt
drdhaval2785 commented 3 years ago

The reason seems that python2 did not have native unicode support. So, till the text encountered a diacritical mark Kaśyapa, everything went fine. But then it broke. In python3, unicode is natively supported. Therefore, there is no error reported.

drdhaval2785 commented 3 years ago

Suggested solution

Explicitly mention python3 in,, codes. Python2 has reached end of life. So we should not bother much about python2 support.

gasyoun commented 3 years ago

python2 did not have native unicode support. So, till the text encountered a diacritical mark Kaśyapa, everything went fine. But then it broke.

Oh, simple.

Python2 has reached end of life. So we should not bother much about python2 support.

But we can mention that it will not work. So nobody else has the trouble you had.

funderburkjim commented 3 years ago

test 1, at Cologne

At Cologne, within scripts, 'python' refers to python version 2.7.5. Consider, at Cologne, this script (in ACCScan/2020/pywork):

python --version
python ../orig/acc.txt acchw.txt tempacc.xml

Then 'sh' works fine: output is just

> sh
Python 2.7.5

(There is no output when tempacc.xml is created.) Also, the files acc.xml and tempacc.xml are identical, as expected.

NOTE: In my ~/.bashrc at Cologne, there is an alias:

> grep 'python' ~/.bashrc
alias python='/usr/bin/python3'

Thus, at the terminal command-line, 'python --version' shows Python 3.6.8. Not sure whether this alias is something I added. At any rate, aliases are not honored as in 'sh' above, so our make scripts at Cologne are running under Python 2.7.5.

test2, local, 2.7.10

On local Windows PC, my default 'python' is version 3.7.0, but I also have a python 2.7.10 version. Running the above with this 2.7.10 python also gives no error on local machine.

test 3: local 2.7.17

I downloaded 2.7.17 from, and installed in C:\python2717. Then ran /c/Python2717/python ../orig/acc.txt acchw.txt tempacc.xml Also, no problem. tempacc.xml same as acc.xml.

Dhaval's code problem looks to be something very specific to his python2.7.17.

I noticed a 2.7.18 version of Python.

Don't yet see a need to adjust our scripts.

drdhaval2785 commented 2 years ago

The issue did not come up with my latest python settings. Regeneration was quite fast also, my computer being of the same configuration. It seems to do something with the bathes of 10000 words being added, or maybe python update. I have not run profilers, but speed improvement in local regeneration is significant.