Open funderburkjim opened 4 years ago
While similar in some ways to the relation between the make_xml.py template (see #5) and the individual versions, this relation is somewhat different.
In the case of make_xml.py template, the template is used to generate, for each dictionary xxx, a version of make_xml.py for that dictionary that is functionally the same as the prior distinct version; namely the generated and distinct versions create the same xxx.xml file.
The xxx.dtd generated by one.dtd is also functionally similar to the previous distinct xxx.dtd, in that the xml file xxx.xml is judged valid by both.
However, the xxx.dtd generated by one.dtd is quite different from the previous distinct xxx.dtd. The way the one.dtd template was developed is described in readme_dtd.txt. In brief, one.dtd started out as a copy of the previous distinct acc.dtd. Then, one.dtd was adjusted one dictionary at a time by
For dictionary xxx, the xml root of xxx.xml is xxx. In other words, the xml structure of xxx.xml is
<xxx>
<!-- many other elements used in the xml form of the dictionary entries -->
</xxx>
So, at least the differences in root elements dictates that one.dtd must use xxx as a template variable.
Otherwise, there are only two places where template variables are used.
<div n="?">...
; that is, the div element has an
n attribute with value ?. One way attribute values can be specified in a dtd is by an
a list of possible values. In all other dictionary dtds, the possible values of the n attribute of div
element is given by such a list. But, according to the very strict rules of dtd formation, there are
restrictions regarding the character set which is allowed to be used in the specification of a possible
attribute value in such a list; and the ?
character is not allowed.
Thus, in ap.dtd, we must specify the n attribute of div by the more general CDATA specification,
which does validate the <div n="?">
usage.H1
.
But in case of mw, the children of the root element are of 20 types which can be understood
by the regular expression H[1-4][ABCE]?.
Currently, one.dtd template generates different values for the children of the root.We remove the template distinction for AP dictionary as follows.
<div n="?">
form is introduced by make_xml.py. We can change this to
<div n="Q">
[Q
is not used elsewhere as a value of 'n' for 'div'; and Q for Question].<div n="?">
in ap dictionary ; we see (in function sthndl_div for 'ap') that any value of 'n'
other than '2' or '3' is simply a line break; so no change required in basicdisplay.php.With all the dtds represented in one.dtd, we can now examine one.dtd with an eye towards simplification.
<div n="lb">
. It would be simpler to choose one or the other as standard, and
then change the non-conforming dictionaries. This would probably involve selective changes
to make_xml.py in csl-pywork/v02, and basicdisplay.php (in both csl-websanlexicon/v02 and
apidev).In investigating such simplifications as above, some additional software tools will probably be needed. One that comes to mind and that is already written is:
<...>
. This is useful for determining the tags,
attributes, and attribute values actually occurring in a particular xxx.xml.
<lang
tag. Use the results as a guide to developing changes to various
xxx.txt digitizations. Finally, when all dictionaries are changed,
modify one.dtd to remove the now unused attribute values of <lang n=
.@funderburkjim
But in case of mw, the children of the root element are of 20 types which can be understood by the regular expression H[1-4][ABCE]?.
What tool would you use to count how many of each 20 are there?
@funderburkjim
But in case of mw, the children of the root element are of 20 types which can be understood by the regular expression H[1-4][ABCE]?.
What tool would you use to count how many of each 20 are there? Can we use something like https://github.com/teeshop/rexgen ? [root@localhost rexgen]# rexgen H[1-4][ABCE] | wc -l 16 [root@localhost rexgen]# rexgen H[1-4][ABCE] H1A H2A H3A H4A H1B H2B H3B H4B H1C H2C H3C H4C H1E H2E H3E H4E
Some additional tools needed.
- check_xml_tags.py (currently exists in MWScan/2014/pywork/) This program reads a text file Where in https://github.com/sanskrit-lexicon I can find that MWScan/2014/pywork/ ? None of CSL* repos have check_xml_tags.py. Neither csl-orig or csl-pywork has 2014 (2020 only).
Please advise, I'd like to work on this bash tool. Thank you!
I've added a 'v02/utilities/ folder to this repository, and put check_xml_tags.py there. It is an analytical tool, not used in the dictionary generation.
Thank you! I see it there.
There are often occasions where I want to do some kind of analysis; an example might be to try check_xml_tags.py. BUT I don't want to add material to what is tracked by git. The .gitignore has a 'temp*' line in it. Thus I can add a 'tempxyz' directory any convenient place in the local copy of csl-pywork, and put anything in there.
We might benefit from another branch for this repo. One could switch between branches, using either the default one for the generic dictionary use, or the other branch for some analytics use, if that's is convenient for the team.
What tool would you use to count how many of each 20 are there?
A one-line variation of check_xml_tags.py does the trick. Change line 10 to:
tags = re.findall(r'<H.*?>',line)
Call the new program, for example, v02/utilities/temp.py. And run it with
python temp.py ../../../mw/pywork/mw.xml temp.txt
.
Then temp.txt contains the list of 20, with counts. For example 009468 <H1A>
.
We might benefit from another branch for this repo
My understanding of git does not yet extend to how to make use of branches. If you have something specific in mind, go ahead and give it a try. Let's take it in baby steps until we all understand how to make use of branches. If you do this, be careful as to size of files added to the repository. Currently, the repository tracks just fairly small program files.
Currently, the repository tracks just fairly small program files.
Sure. I think we can benifit from Yevgeniy's experience.
I am from GitLab world, but GitHub should have it also, as that's part of regular Git functionality.
Git has -b
A one-line variation of check_xml_tags.py does the trick. Change line 10 to:
That's probably the safest way, in case the "rexgen" tool has differences in regex engine it uses, as some metasymbols might be interpreted slightly different depending on parser (there's entire book on those regex engines subtleties on Safari, apologies for sidetrack).
I hope that the regex used in DTD is the same parser python uses.
- Write a bash shell to run check_xml_tags on all dictionaries,
This is what I see parsing all the dictionaries:
for i in $(ls ../../../ | egrep -v csl) ; do python check_xml_tags.py ../../../${i}/pywork/${i}.xml ${i}.txt ;done
[root@localhost utilities]# for i in $(ls *.txt); do grep '<lang' $i | awk -F'<' '{print $2}'; done | sort | uniq
lang n="arabic">
lang n="greek">
lang n="Greek">
lang n="meter">
lang n="Old-Church-Slavonic">
lang n="oldhebrew">
lang n="russian">
lang n="Russian">
lang n="slavic">
lang script="Arabic" n="Arabic">
lang script="Arabic" n="Hindustani">
lang script="Arabic" n="Persian">
lang script="Arabic" n="Turkish">
We need to unify Greek/greek (* vs pwg capitalized) and Russian/russian (pw vs pwg capitalized), besides that everything else seems unique. It's simpler to make those two cases in in pwg lowercase, rather than modifying several ones with lower case greek.
[root@localhost utilities]# grep 'lang n="russian"' *
pw.txt:000001 <lang n="russian">
[root@localhost utilities]# grep 'lang n="Russian"' *
pwg.txt:000023 <lang n="Russian">
[root@localhost utilities]# grep 'lang n="greek"' *
ben.txt:001490 <lang n="greek">
bhs.txt:000003 <lang n="greek">
bop.txt:001701 <lang n="greek">
bur.txt:000677 <lang n="greek">
cae.txt:000003 <lang n="greek">
gra.txt:000229 <lang n="greek">
gst.txt:000013 <lang n="greek">
inm.txt:010778 <lang n="greek">
md.txt:000008 <lang n="greek">
mw72.txt:001665 <lang n="greek">
mw.txt:001157 <lang n="greek">
pwg.txt:000397 <lang n="greek">
pw.txt:000186 <lang n="greek">
snp.txt:000001 <lang n="greek">
stc.txt:000001 <lang n="greek">
vei.txt:000147 <lang n="greek">
wil.txt:000023 <lang n="greek">
[root@localhost utilities]# grep 'lang n="Greek"' *
pwg.txt:000001 <lang n="Greek">
I tried to commit the shell script, but it seems I don't have that permission:
create mode 100755 v02/utilities/find_lang_unique.sh
[root@localhost utilities]# git push
Username for 'https://github.com': YevgenJohn
Password for 'https://YevgenJohn@github.com':
remote: Permission to sanskrit-lexicon/csl-pywork.git denied to YevgenJohn.
fatal: unable to access 'https://github.com/sanskrit-lexicon/csl-pywork.git/': The requested URL returned error: 403
Basically, the results above could have been done using one shell script:
#!/bin/bash
for i in $(ls ../../../ | egrep -v csl) ; do python check_xml_tags.py ../../../${i}/pywork/${i}.xml ${i}.txt ;done
for i in $(ls *.txt); do grep '<lang' $i | awk -F'<' '{print $2}'; done | sort | uniq
for s in `for i in $(ls *.txt); do grep '<lang' $i | awk -F'<' '{print $2}'; done | sort | uniq | awk -F'n=' '{print $2}' | uniq -ic | egrep -v ' 1 ' | awk '{printf("n=\"%s%s\nn=%s\n",toupper(substr($2,2,1)),substr($2,3),$2)}'`; do grep "lang $s" * ; done
rm -f *.txt
which would give the same results, so not sure if you need a script in the repo, as this seems to be one-time search:
pwg.txt:000001 <lang n="Greek">
ben.txt:001490 <lang n="greek">
bhs.txt:000003 <lang n="greek">
bop.txt:001701 <lang n="greek">
bur.txt:000677 <lang n="greek">
cae.txt:000003 <lang n="greek">
gra.txt:000229 <lang n="greek">
gst.txt:000013 <lang n="greek">
inm.txt:010778 <lang n="greek">
md.txt:000008 <lang n="greek">
mw72.txt:001665 <lang n="greek">
mw.txt:001157 <lang n="greek">
pwg.txt:000397 <lang n="greek">
pw.txt:000186 <lang n="greek">
snp.txt:000001 <lang n="greek">
stc.txt:000001 <lang n="greek">
vei.txt:000147 <lang n="greek">
wil.txt:000023 <lang n="greek">
pwg.txt:000023 <lang n="Russian">
pw.txt:000001 <lang n="russian">
I am surprised that even scanned images fit there.
No, actually the scanned images are NOT part of any repository.
Currently, the logic involved in displaying scanned images (this logic is part of csl-websanlexicon) looks for a local copy of the images (in the web/pdfpages directory). But if it fails to find images there, it gets the images from Cologne server.
The images are also available from an AWS-S3 bucket, but using that source of images is not currently built into csl-websanlexicon code.
It is precisely for size reasons that the scanned images are not in a repository -- I think their total size would be about 50-60GB.
If we want to give Ubuntu (and other local) installations the option to have local copies of the images, we need to develop some way to do this, and add this to the installation instructions.
If you want to work on this, I can provide some further details.
I hope that the regex used in DTD is the same parser python uses.
The check_xml_tags.py program actually is not using a python xml parser. It is just reading the xml file as
lines of text and then looking for <...>
tags.
On local XAMPP system, it is hard to get the xmllint xml-validator -- xmllint is used in the redo_xml.sh script to check that a given dictionary validates according to its dtd.
As a substitute, I have written a (simple) xml validator in python; this is based on the lxml python library. In recent work, I found that the python validator and xmllint validator seemed always to give the same results. So I feel comfortable using the python validator locally. However, the xmllint validator is often much faster to detect errors than the python validator.
actually the scanned images are NOT part of any repository. If you want to work on this, I can provide some further details.
Absolutely, I would like to work on this, as in case Cologne server is not available, or VM runs offline, we need an option to stuff VM with local images. I suspect some discrepancies between digital version and pictures are inevitable for this size of project, so it's important to have picture alongside with the digital version of dictionary.
I don't know if GitHub charges for 50-60GB of pictures, which would be accessing read-only, and if that's cheaper comparing to AWS-S3 bucket.
Please advise what to take a look at (I guess image fetching is part of php), so we can give that option to the standalone builds.
Thank you!
@funderburkjim I'm impressed by https://github.com/sanskrit-lexicon/csl-pywork/tree/master/v02
In the previous revision of csl-pywork, the dictionary dtds (xxx.dtd) were in 'distinctfiles'. That is, when reconstructing a 2020 dictionary, csl-pywork used a separate version of the xxx.dtd program for each dictionary. Now, csl-pywork uses one one.dtd template to create the different versions.
This is an improvement, because now we can see all the variations in one place.