sanskrit-lexicon / csl-pywork

A template for creating pywork repository for each dictionary.
3 stars 1 forks source link

xxx.dtd a template #6

Open funderburkjim opened 4 years ago

funderburkjim commented 4 years ago

In the previous revision of csl-pywork, the dictionary dtds (xxx.dtd) were in 'distinctfiles'. That is, when reconstructing a 2020 dictionary, csl-pywork used a separate version of the xxx.dtd program for each dictionary. Now, csl-pywork uses one one.dtd template to create the different versions.

This is an improvement, because now we can see all the variations in one place.

funderburkjim commented 4 years ago

how one.dtd was constructed.

While similar in some ways to the relation between the make_xml.py template (see #5) and the individual versions, this relation is somewhat different.

In the case of make_xml.py template, the template is used to generate, for each dictionary xxx, a version of make_xml.py for that dictionary that is functionally the same as the prior distinct version; namely the generated and distinct versions create the same xxx.xml file.

The xxx.dtd generated by one.dtd is also functionally similar to the previous distinct xxx.dtd, in that the xml file xxx.xml is judged valid by both.

However, the xxx.dtd generated by one.dtd is quite different from the previous distinct xxx.dtd. The way the one.dtd template was developed is described in readme_dtd.txt. In brief, one.dtd started out as a copy of the previous distinct acc.dtd. Then, one.dtd was adjusted one dictionary at a time by

funderburkjim commented 4 years ago

Must one.dtd be a template?

For dictionary xxx, the xml root of xxx.xml is xxx. In other words, the xml structure of xxx.xml is

<xxx>
<!-- many other elements used in the xml form of the dictionary entries -->
</xxx>

So, at least the differences in root elements dictates that one.dtd must use xxx as a template variable.

Otherwise, there are only two places where template variables are used.

funderburkjim commented 4 years ago

removal of template logic for AP

We remove the template distinction for AP dictionary as follows.

funderburkjim commented 4 years ago

Suggestions for improvement

With all the dtds represented in one.dtd, we can now examine one.dtd with an eye towards simplification.

Some additional tools needed.

In investigating such simplifications as above, some additional software tools will probably be needed. One that comes to mind and that is already written is:

gasyoun commented 4 years ago

@funderburkjim

But in case of mw, the children of the root element are of 20 types which can be understood by the regular expression H[1-4][ABCE]?.

What tool would you use to count how many of each 20 are there?

YevgenJohn commented 4 years ago

@funderburkjim

But in case of mw, the children of the root element are of 20 types which can be understood by the regular expression H[1-4][ABCE]?.

What tool would you use to count how many of each 20 are there? Can we use something like https://github.com/teeshop/rexgen ? [root@localhost rexgen]# rexgen H[1-4][ABCE] | wc -l 16 [root@localhost rexgen]# rexgen H[1-4][ABCE] H1A H2A H3A H4A H1B H2B H3B H4B H1C H2C H3C H4C H1E H2E H3E H4E

YevgenJohn commented 4 years ago

Some additional tools needed.

  • check_xml_tags.py (currently exists in MWScan/2014/pywork/) This program reads a text file Where in https://github.com/sanskrit-lexicon I can find that MWScan/2014/pywork/ ? None of CSL* repos have check_xml_tags.py. Neither csl-orig or csl-pywork has 2014 (2020 only).

Please advise, I'd like to work on this bash tool. Thank you!

funderburkjim commented 4 years ago

I've added a 'v02/utilities/ folder to this repository, and put check_xml_tags.py there. It is an analytical tool, not used in the dictionary generation.

YevgenJohn commented 4 years ago

Thank you! I see it there.

funderburkjim commented 4 years ago

Note on .gitignore

There are often occasions where I want to do some kind of analysis; an example might be to try check_xml_tags.py. BUT I don't want to add material to what is tracked by git. The .gitignore has a 'temp*' line in it. Thus I can add a 'tempxyz' directory any convenient place in the local copy of csl-pywork, and put anything in there.

YevgenJohn commented 4 years ago

We might benefit from another branch for this repo. One could switch between branches, using either the default one for the generic dictionary use, or the other branch for some analytics use, if that's is convenient for the team.

funderburkjim commented 4 years ago

What tool would you use to count how many of each 20 are there?

A one-line variation of check_xml_tags.py does the trick. Change line 10 to: tags = re.findall(r'<H.*?>',line)

Call the new program, for example, v02/utilities/temp.py. And run it with python temp.py ../../../mw/pywork/mw.xml temp.txt .

Then temp.txt contains the list of 20, with counts. For example 009468 <H1A>.

funderburkjim commented 4 years ago

We might benefit from another branch for this repo

My understanding of git does not yet extend to how to make use of branches. If you have something specific in mind, go ahead and give it a try. Let's take it in baby steps until we all understand how to make use of branches. If you do this, be careful as to size of files added to the repository. Currently, the repository tracks just fairly small program files.

gasyoun commented 4 years ago

Currently, the repository tracks just fairly small program files.

Sure. I think we can benifit from Yevgeniy's experience.

YevgenJohn commented 4 years ago

I am from GitLab world, but GitHub should have it also, as that's part of regular Git functionality. Git has -b attribute for clone, checkout etc. https://stackoverflow.com/questions/1911109/how-do-i-clone-a-specific-git-branch This is their docs (which I consider the best): https://git-scm.com/book/en/v2/Git-Branching-Basic-Branching-and-Merging Please let me try if I could add a branch for the repo with my credentials. Branch can always be removed when not needed. We do personal branches at work, per person, per task etc, so we end up eventually merging some of them, or removing some others. Understood with the size, as we are dealing with the scripts it shouldn't be an issue. I am surprised that even scanned images fit there. One of my Git project has currently over 5,000 branches and it is doing well, so Git has capacity, being made for Linux kernel with thousands of participants, each has own branch oftentimes, as the merge request must come from the branch to be added to the master branch. It is a powerful mechanism to let code exist in parallel yet linked to the same repository. Here's how GitHub manages it: https://help.github.com/en/articles/creating-and-deleting-branches-within-your-repository When I do it using my credentials it doesn't show 'Create branch', I must not have that permission, but the repo owner has that option, so the "analytics" branch (for example) could be created. All we need to do is use '-b analytics' when we work with that branch.

YevgenJohn commented 4 years ago

A one-line variation of check_xml_tags.py does the trick. Change line 10 to:

That's probably the safest way, in case the "rexgen" tool has differences in regex engine it uses, as some metasymbols might be interpreted slightly different depending on parser (there's entire book on those regex engines subtleties on Safari, apologies for sidetrack).

I hope that the regex used in DTD is the same parser python uses.

YevgenJohn commented 4 years ago
  • Write a bash shell to run check_xml_tags on all dictionaries,

This is what I see parsing all the dictionaries:

for i in $(ls ../../../ | egrep -v csl) ; do python check_xml_tags.py ../../../${i}/pywork/${i}.xml ${i}.txt ;done
[root@localhost utilities]# for i in $(ls *.txt); do grep '<lang' $i | awk -F'<' '{print $2}'; done | sort | uniq
lang n="arabic">
lang n="greek">
lang n="Greek">
lang n="meter">
lang n="Old-Church-Slavonic">
lang n="oldhebrew">
lang n="russian">
lang n="Russian">
lang n="slavic">
lang script="Arabic" n="Arabic">
lang script="Arabic" n="Hindustani">
lang script="Arabic" n="Persian">
lang script="Arabic" n="Turkish">

We need to unify Greek/greek (* vs pwg capitalized) and Russian/russian (pw vs pwg capitalized), besides that everything else seems unique. It's simpler to make those two cases in in pwg lowercase, rather than modifying several ones with lower case greek.

[root@localhost utilities]# grep 'lang n="russian"' *
pw.txt:000001 <lang n="russian">
[root@localhost utilities]# grep 'lang n="Russian"' *
pwg.txt:000023 <lang n="Russian">
[root@localhost utilities]# grep 'lang n="greek"' *
ben.txt:001490 <lang n="greek">
bhs.txt:000003 <lang n="greek">
bop.txt:001701 <lang n="greek">
bur.txt:000677 <lang n="greek">
cae.txt:000003 <lang n="greek">
gra.txt:000229 <lang n="greek">
gst.txt:000013 <lang n="greek">
inm.txt:010778 <lang n="greek">
md.txt:000008 <lang n="greek">
mw72.txt:001665 <lang n="greek">
mw.txt:001157 <lang n="greek">
pwg.txt:000397 <lang n="greek">
pw.txt:000186 <lang n="greek">
snp.txt:000001 <lang n="greek">
stc.txt:000001 <lang n="greek">
vei.txt:000147 <lang n="greek">
wil.txt:000023 <lang n="greek">
[root@localhost utilities]# grep 'lang n="Greek"' *
pwg.txt:000001 <lang n="Greek">
YevgenJohn commented 4 years ago

I tried to commit the shell script, but it seems I don't have that permission:

 create mode 100755 v02/utilities/find_lang_unique.sh
[root@localhost utilities]# git push
Username for 'https://github.com': YevgenJohn
Password for 'https://YevgenJohn@github.com':
remote: Permission to sanskrit-lexicon/csl-pywork.git denied to YevgenJohn.
fatal: unable to access 'https://github.com/sanskrit-lexicon/csl-pywork.git/': The requested URL returned error: 403

Basically, the results above could have been done using one shell script:

#!/bin/bash
for i in $(ls ../../../ | egrep -v csl) ; do python check_xml_tags.py ../../../${i}/pywork/${i}.xml ${i}.txt ;done
for i in $(ls *.txt); do grep '<lang' $i | awk -F'<' '{print $2}'; done | sort | uniq
for s in `for i in $(ls *.txt); do grep '<lang' $i | awk -F'<' '{print $2}'; done | sort | uniq | awk -F'n=' '{print $2}' | uniq -ic | egrep -v ' 1 ' | awk '{printf("n=\"%s%s\nn=%s\n",toupper(substr($2,2,1)),substr($2,3),$2)}'`; do grep "lang $s" * ; done
rm -f *.txt

which would give the same results, so not sure if you need a script in the repo, as this seems to be one-time search:

pwg.txt:000001 <lang n="Greek">
ben.txt:001490 <lang n="greek">
bhs.txt:000003 <lang n="greek">
bop.txt:001701 <lang n="greek">
bur.txt:000677 <lang n="greek">
cae.txt:000003 <lang n="greek">
gra.txt:000229 <lang n="greek">
gst.txt:000013 <lang n="greek">
inm.txt:010778 <lang n="greek">
md.txt:000008 <lang n="greek">
mw72.txt:001665 <lang n="greek">
mw.txt:001157 <lang n="greek">
pwg.txt:000397 <lang n="greek">
pw.txt:000186 <lang n="greek">
snp.txt:000001 <lang n="greek">
stc.txt:000001 <lang n="greek">
vei.txt:000147 <lang n="greek">
wil.txt:000023 <lang n="greek">
pwg.txt:000023 <lang n="Russian">
pw.txt:000001 <lang n="russian">
funderburkjim commented 4 years ago

I am surprised that even scanned images fit there.

No, actually the scanned images are NOT part of any repository.

Currently, the logic involved in displaying scanned images (this logic is part of csl-websanlexicon) looks for a local copy of the images (in the web/pdfpages directory). But if it fails to find images there, it gets the images from Cologne server.

The images are also available from an AWS-S3 bucket, but using that source of images is not currently built into csl-websanlexicon code.

It is precisely for size reasons that the scanned images are not in a repository -- I think their total size would be about 50-60GB.

If we want to give Ubuntu (and other local) installations the option to have local copies of the images, we need to develop some way to do this, and add this to the installation instructions.

If you want to work on this, I can provide some further details.

funderburkjim commented 4 years ago

I hope that the regex used in DTD is the same parser python uses.

The check_xml_tags.py program actually is not using a python xml parser. It is just reading the xml file as lines of text and then looking for <...> tags.

aside on xml validators

On local XAMPP system, it is hard to get the xmllint xml-validator -- xmllint is used in the redo_xml.sh script to check that a given dictionary validates according to its dtd.
As a substitute, I have written a (simple) xml validator in python; this is based on the lxml python library. In recent work, I found that the python validator and xmllint validator seemed always to give the same results. So I feel comfortable using the python validator locally. However, the xmllint validator is often much faster to detect errors than the python validator.

YevgenJohn commented 4 years ago

actually the scanned images are NOT part of any repository. If you want to work on this, I can provide some further details.

Absolutely, I would like to work on this, as in case Cologne server is not available, or VM runs offline, we need an option to stuff VM with local images. I suspect some discrepancies between digital version and pictures are inevitable for this size of project, so it's important to have picture alongside with the digital version of dictionary.

I don't know if GitHub charges for 50-60GB of pictures, which would be accessing read-only, and if that's cheaper comparing to AWS-S3 bucket.

Please advise what to take a look at (I guess image fetching is part of php), so we can give that option to the standalone builds.

Thank you!

gasyoun commented 4 years ago

@funderburkjim I'm impressed by https://github.com/sanskrit-lexicon/csl-pywork/tree/master/v02