Closed drdhaval2785 closed 2 years ago
This issue seems to run deeper. Almost all of my XMLs had only 3-4 entries and abruptly ended.
only 3-4 entries and abruptly ended.
@funderburkjim ever seen before?
This does not happen for me.
Jim@Jim-Dell MINGW64 /c/xampp/htdocs/cologne/csl-pywork/v02 (master) $ sh generate_dict.sh ap90 ../../ap90 updating for websanlexicon for dictionary ap90 to /c/xampp/htdocs/cologne/ap90 regenerate ap90 headwords BEGIN redo_hw.sh construct xxxhw.txt 437 extra headwords from hwextra/ap90_hwextra.txt 267150 lines read from ../orig/ap90.txt 31751 entries found 32188 lines written to ap90hw.txt construct xxxhw2.txt construct xxxhw0.txt DONE redo_hw.sh regenerate ap90.xml and postxml files BEGIN redo_xml.sh construct ap90.xml... xmllint on ap90.xml... redo_xml.sh: line 5: xmllint: command not found ap90.sqlite... remaking ap90.sqlite from ../ap90.xml with python... sqlite.py: dictionary code= ap90 create_index takes 0.22 seconds 32193 lines read from ../ap90.xml 32188 rows written to ap90.sqlite 5.51 seconds for batch size 10000 moving ap90.sqlite to web/sqlite/ 32188 records read from ../ap90.xml 31984 records written to query_dump.txt END redo_xml.sh regenerate downloads BEGIN: downloads/redo_txt.sh remove old ap90txt.zip copying files from ../pywork to txt/ create new ap90txt.zip BEGIN: downloads/redo_xml.sh remove old ap90xml.zip copying files from ../pywork to xml/ create new ap90xml.zip BEGIN: downloads/redo_web1.sh remove old ap90web1.zip
My python version is 3.7.0 (windows)
The repositories: csl-pywork, csl-websanlexicon, csl-orig are all up to date with Github.
I did a bit of analysis. It is due to python2 and python3 differences.
I tested the following code from redo_xml.sh
with python2 and python3.
python3 did not give any error. python2 gave the error I mentioned earlier.
dhaval@dhaval-Aspire-5750:/var/www/html/cologne/acc/pywork$ python --version
Python 2.7.17
dhaval@dhaval-Aspire-5750:/var/www/html/cologne/acc/pywork$ python make_xml.py ../orig/acc.txt acchw.txt acc.xml # > redoxml_log.txt
xml error: n=8,m line=
<H1><h><key1>aMSumadBedasaMgraha</key1><key2>aMSumadBedasaMgraha</key2></h><body><s>aMSumadBedasaMgraha</s> vedānta, ascribed to Kaśyapa. Oppert 5875.</body><tail><L>4</L><pc>1-001,1</pc></tail></H1>
dhaval@dhaval-Aspire-5750:/var/www/html/cologne/acc/pywork$ python3 --version
Python 3.6.9
dhaval@dhaval-Aspire-5750:/var/www/html/cologne/acc/pywork$ python3 make_xml.py ../orig/acc.txt acchw.txt acc.xml # > redoxml_log.txt
dhaval@dhaval-Aspire-5750:/var/www/html/cologne/acc/pywork$
The reason seems that python2 did not have native unicode support. So, till the text encountered a diacritical mark Kaśyapa
, everything went fine. But then it broke.
In python3, unicode is natively supported. Therefore, there is no error reported.
Explicitly mention python3
in redo_hw.sh, redo_xml.sh, redo_postxml.sh codes.
Python2 has reached end of life. So we should not bother much about python2 support.
python2 did not have native unicode support. So, till the text encountered a diacritical mark Kaśyapa, everything went fine. But then it broke.
Oh, simple.
Python2 has reached end of life. So we should not bother much about python2 support.
But we can mention that it will not work. So nobody else has the trouble you had.
At Cologne, within scripts, 'python' refers to python version 2.7.5. Consider, at Cologne, this temp.sh script (in ACCScan/2020/pywork):
python --version
python make_xml.py ../orig/acc.txt acchw.txt tempacc.xml
Then 'sh temp.sh' works fine: output is just
> sh temp.sh
Python 2.7.5
(There is no output when tempacc.xml is created.) Also, the files acc.xml and tempacc.xml are identical, as expected.
NOTE: In my ~/.bashrc at Cologne, there is an alias:
> grep 'python' ~/.bashrc
alias python='/usr/bin/python3'
Thus, at the terminal command-line, 'python --version' shows Python 3.6.8. Not sure whether this alias is something I added. At any rate, aliases are not honored as in 'sh temp.sh' above, so our make scripts at Cologne are running under Python 2.7.5.
On local Windows PC, my default 'python' is version 3.7.0, but I also have a python 2.7.10 version. Running the above with this 2.7.10 python also gives no error on local machine.
I downloaded 2.7.17 from https://www.python.org/downloads/windows/,
and installed in C:\python2717.
Then ran
/c/Python2717/python make_xml.py ../orig/acc.txt acchw.txt tempacc.xml
Also, no problem. tempacc.xml same as acc.xml.
Dhaval's code problem looks to be something very specific to his python2.7.17.
I noticed a 2.7.18 version of Python.
Don't yet see a need to adjust our scripts.
The issue did not come up with my latest python settings. Regeneration was quite fast also, my computer being of the same configuration. It seems to do something with the bathes of 10000 words being added, or maybe python update. I have not run profilers, but speed improvement in local regeneration is significant.
Today when I was regenerating my local version from csl-orig and csl-pywork repositories, I ran into errors in quite a few of dictionaries. Attaching a log of the redo_xml.sh for AP90 dictionary.