dtd study - Githubissues

funderburkjim commented 4 years ago

This pertains to https://github.com/sanskrit-lexicon/csl-pywork/issues/6#issuecomment-544297530.

As can be seen in the link, there are some minor spelling differences among dictionaries that should be removed, for the sake of uniformity and simplicity. For example <lang n="Greek"> should be changed to <lang n="greek"> in pwg, since most of the dictionaries use the lower case spelling.

A study is needed to prepare data that can be served as basis for this.

The check_xml_tags.py program and the grep/awk bash script are close to what is needed, but not quite.

The point is that these listings can get cluttered with many thousands of lines due to attributes that can have any text value. For instance, python check_xml_tags.py MW temp_mw_tags.txt generates a file with 28000+ lines, which is too much data to be useful.

The solution is to refine the filter.

funderburkjim commented 4 years ago

The dtd will help

one.dtd can help.

Look at all the <ATTLIST elt attr VALUES> lines. The VALUES can have one of two forms: CDATA or an enumerated list of allowed values.
CDATA means (almost) any value, so these are the places that generate the thousands of lines.

A modified check_xml_tags program should aggregate just over the attribute name, rather than include the attribute value, when dealing with a CDATA attribute. On the other hand, for the non CDATA attributes, the aggregation should include the attribute value.

Another minor improvement would exclude the closing tags </xyz> -- no need to aggregate them.

Probably a variant of check_xml_tags is the best way to do the above.

A loop over dictionaries could be done within the python program.

The format vei.txt:000147 <lang n="greek"> for output is probably good, where the sort is two-fold, first over tag (<xxx>, then over dictionary code .

@YevgenJohn -- do you want to prepare this? Once it is ready, we can begin to identify the places where the digitizations should be changed, like 'Greek' -> 'greek' above.

funderburkjim commented 4 years ago

Here is a Python list of dictionary codes:

["acc","ae","ap90","ben","bhs","bop","bor","bur","cae","ccs","gra","gst","ieg","inm","krm","mci","md","mw","mw72","mwe","pe","pgn","pui","pw","pwg","sch","shs","skd","snp","stc","vcp","vei","wil","yat"]

YevgenJohn commented 4 years ago

Absolutely, I would like to prepare this, working on it... Thank you

YevgenJohn commented 4 years ago

I created 'analytics' branch to the repo. It has script as well as the results from my local run: https://github.com/sanskrit-lexicon/csl-pywork/commit/f56294c23624bb8699e7ae6c86edd321f808eae0 Please advise if PrettyPrinter's result is easy to understand, as I wasn't sure how would we process the results, so made the simplest code possible. For example, for the ACC dictionary, the output looks like this:

--- ../../../acc/pywork/acc.xml ---
{   'F': {   '_cnt_': 33},
    'H': {   '_cnt_': 49},
    'H1': {   '_cnt_': 49822},
    'L': {   '_cnt_': 49822},
    'acc': {   '_cnt_': 1},
    'alt': {   '_cnt_': 1592},
    'body': {   '_cnt_': 49822},
    'br': {   '_cnt_': 36696},
    'div': {   '_cnt_': 37116, 'n': {   '2': 31678, '3': 5418, 'P': 20}},
    'h': {   '_cnt_': 49822},
    'hwtype': {   '_cnt_': 1592, 'n': {   'alt': 1592}, 'ref': 1592},
    'i': {   '_cnt_': 1373},
    'key1': {   '_cnt_': 49822},
    'key2': {   '_cnt_': 49822},
    'pc': {   '_cnt_': 49822},
    's': {   '_cnt_': 56786},
    'symbol': {   '_cnt_': 12375, 'n': 12375},
    'tail': {   '_cnt_': 49822}}

cnt is the counter (with the name to distinguish from the attributes). Those with CDATA has only counter, while those with list of values have counter per value. Please advise what would we like to do with it further. Thank you!

drdhaval2785 commented 4 years ago

The candidates for further examination seems to be

acc
div.P
Whether alt and hw.alt are both needed.

YevgenJohn commented 4 years ago

The candidates for further examination seems to be

acc

This is the root tag with the dictionary name, as the file was acc/pywork/acc.xml. I should probably remove the root element from the output. Thank you!

For dictionary xxx, the xml root of xxx.xml is xxx. In other words, the xml structure of xxx.xml is
<xxx>
<!-- many other elements used in the xml form of the dictionary entries -->
</xxx>

funderburkjim commented 4 years ago

@YevgenJohn

Looks like the information needed is in your json output. Good going! But Would request another output format: 3 fields, separated by spaces character

col 1 : count 6 characters, right justified, 0 filled
col 2 : dictionary code, 4 characters, left justified
col 3 : tag
- <tag> if 'tag' appears with no attributes in xxx.xml
- <tag attr> if 'attr' is a CDATA attribute type in one.dtd
- <tag attr="value"> if 'attr' is enumerated attribute type in one.dtd

Examples:

000033 acc  <F>
...
031678 acc  <div n="2">
005418 acc  <div n="3">
000020 acc  <div n="20">
...
001361 mw  <info westergaard>

Request you to generate two output files in above format, with different sorts:

sort by col2 as major sort key, col3 as minor sort key
- with this sort, we'll see all the tags used in a particular dictionary
sort by col3 as major sort key, col2 as minor sort key
- with this sort, we'll see how a given tag is used in various dictionaries.

I agree that the root element (xxx) can be eliminated in the output, but on the other hand its only 34 extra lines out of 700+ lines, so removing it is not too important if it takes much time to do.

Note: there might be some tags which appear sometimes with an attribute, and sometimes with no attribute. <div> is a possible case. I'm not sure if that distinction is in the json output.

Question: On my local copy of csl-pywork, what basic commands do I use to

pull down the 'analytics' branch from Github
push the 'analytics' branch to Github
switch to the branch from master
switch back to master branch from 'analytics' branch.

funderburkjim commented 4 years ago

comment on run.sh

Your bash script obviously runs on your system just fine. But, it derives the dictionary codes in a way that could be unreliable (e.g., suppose I had a cologne/tempacc folder). Also, the form would be hard to adapt to run on the Cologne server, where the location of pywork folder is slightly different:

scans
- ACCScan
  - 2020
    - pywork
- csl-pywork
- etc.

So here is another form of run.sh that would be more flexible:

for dictlo in  acc ae ap ap90 ben   bhs bop bor bur cae  ccs gra gst ieg inm  krm mci md mw mw72  mwe pd pe pgn pui    pw pwg sch shs skd snp stc vcp vei wil  yat
do
 f="../../../${dictlo}/pywork/${dictlo}.xml"
 python parse.py $f ../makotemplates/pywork/one.dtd
done

Then, the script could be run at command line by

sh run.sh > run.txt 2> temp_run_err.txt

Note the use of '.txt' file type for text output. This is a convention, and helps Windows OS identify how to open the file. Notice also the convention to name the output files in a way that relates them to the program. Also, the err file is a temp file, which would keep it from being tracked by Git (since 'temp' is in topleve .gitignore

I realize these are kind of minor, picky suggestions. Hope you don't mind.

YevgenJohn commented 4 years ago

Question: On my local copy of csl-pywork, what basic commands do I use to

pull down the 'analytics' branch from Github

push the 'analytics' branch to Github

switch to the branch from master

switch back to master branch from 'analytics' branch.

I will be working on the enhancement mentioned above.

While Git allows several ways to work switch branches, I personally (for safety reasons), check out branches in different folders, so I don't have to switch between then within the same folder (that has obstacles when local copy has non-committed files). So I'd create something like:

mkdir master && cd master
git clone https://github.com/sanskrit-lexicon/csl-pywork.git
git commit 
git push   # this will push to the master branch

mkdir analytics && cd analytics
git clone -b analytics https://github.com/sanskrit-lexicon/csl-pywork.git
git commit 
git push   # this will push to the analytics branch

That way I can just go the master or analytics folders and it will serve appropriate branches.

The GitHub Gui shows the current branch which can easily be altered between them to check commits to the required branch.

YevgenJohn commented 4 years ago

So here is another form of run.sh that would be more flexible:


for dictlo in  acc ae ap ap90 ben   bhs bop bor bur cae  ccs gra gst ieg inm  krm mci md mw mw72  mwe pd pe pgn pui    pw pwg sch shs skd snp stc vcp vei wil  yat
do
I realize these are kind of minor, picky suggestions. Hope you don't mind.

Absolutely, I appreciate any help with conventions which I still need to absorb to make it compatible with the entire system design, thank you!

I made run.sh as a quick iterator over dictionaries (which caught some unnecessary files reflected in err), so, definitely, this version is a more universal one, I'll be making them that way, knowing that the file structure might be different between servers.

I'm still learning around the content of it trying to understand the transformation the setup script is doing. I see it takes some textual dictionary files and replaces that markup with the tags and places into xml files. I suspect that the transformation is dictionary specific, as well as php code which displays it is dictionary specific with the tags interpretation, so the unification of those markup->tags and tags->UI is the task we perform here. Please correct me if I am wrong, I am trying to get the bigger picture of what's going on. Thank you!

funderburkjim commented 4 years ago

comment on Python

The Python version still in effect on Cologne server is 2.6, and from the 'print xxx' statement in your parse.py, I think you must also be running a Python 2. Dhaval and I have begun writing our code so it will run on either python 2 or 3.

One of the details is that in Python 2, you need to take care how to open files with utf-8 encoded unicode. The xml files are all utf-8 encoded Unicode, also the xxx.txt digitizations. The following to be reliable, for both Python 2 and Python3:

 import codecs
with codecs.open(filename, "r","utf-8") as f:
# similar for opening for writing

lxml

I note you are using lxml. I just discovered it is available under Python2 on Cologne, but not Python 3.4. None of the production code at Cologne uses lxml. And, due to difficulty in installation of lxml, I don't think production code should use lxml.

But since you're doing a one-off analysis which should run with python2 at Cologne, I guess it doesn't matter much. But do keep this in mind.

In general, I like to use only simple built-in Python modules; because it makes code more portable. @drdhaval2785 what's your opinion on this?

drdhaval2785 commented 4 years ago

I am with you @funderburkjim. @YevgenJohn, the server at Cologne does not allow us to install various libraries freely. We have to request the webmasters to do it for us. They have python2.6.6 and python3.4.10 I guess. Therefore my advice would be

to use code compatible with these two versions and not to use functionalities offered by newer versions.
Avoid using third party libraries as far as possible. e.g. etree would be preferred over lxml, even if a bit tedious.

funderburkjim commented 4 years ago

the transformation is dictionary specific

Exactly right. In brief, the starting point xxx.txt of a dictionary is in the csl-orig repository. The 'installation' of a dictionary copies xxx.,txt from csl-orig to xxx/orig/xxx.txt. This is converted to xml by the xxx/pywork/redo_xml.sh . But this script is generated from a template : csl-pywork/v02/makotemplates/pywork/redo_xml.sh. There is a template variable dictlo that is used in various template expressions (such as %if dictlo in ['pw','pwg']: ....) to generate different code for different dictionaries So, for example pwg/pywork/redo_xml.sh will be different than mw/pywork/redo_xml.sh.

There is another level of templating in the display generation (going from xxx.xml to html), which occurs in php files. For example xxx/web/webtc/basicdisplay.php. This php program is actually the same for all dictionaries, but there are many (if ($dict == 'mw') {...}) which in effect generate different html for different dictionaries.

The best starting point to understand how the cologne/xxx directories are generated is the csl-pywork/v02/generate_dict.sh script which does the generation of one dictionary.

One of the reasons for the xml tag analysis you've worked on is to help in identifying places where simplifications can be made in the templating -- e.g., where artificial differences in markup can be resolved.

drdhaval2785 commented 4 years ago

@YevgenJohn

The discussion on DTD was done some years back.

https://github.com/fxru/CDSL-DTD-comparison/blob/master/comparison_CDSL_DTDs.csv is the output genereated by @fxru. There may be some minor changes in DTD tags over the time, but this would also give you some idea about the things as they stood some years back.

drdhaval2785 commented 4 years ago

https://github.com/sanskrit-lexicon/Cologne/issues/87 was the discussion thread for creating a unified DTD and eliminating unwanted discrepancies across dictionaries.

YevgenJohn commented 4 years ago

Thank you for providing me the information about the context of this situation and the reality limits we have in place with the server. I will be confining anything targeted to the production server within the limits imposed (I understand that, being on a sysadmin side, imposing similar things on own infrastructure for many reasons, including security). We can do analytics in some developers environment with the required tools for the fastest answers in the meantime, and if any of the scrips are needed to be re-run at the production server it can be downgraded to the tools available there. I am working on modifying the output to the desired state catching up with absorbing the existing infrastructure and the projects, the learning curve. Thank you!

funderburkjim commented 4 years ago

@YevgenJohn Thanks for suggestions on using branches. I like idea of opening analytics branch in separate local folder, to avoid possible confusion with master branch.

YevgenJohn commented 4 years ago

I updated the output:

sort by col2 as major sort key, col3 as minor sort key

https://github.com/sanskrit-lexicon/csl-pywork/blob/analytics/v02/utilities/res23.txt

sort by col3 as major sort key, col2 as minor sort key

https://github.com/sanskrit-lexicon/csl-pywork/blob/analytics/v02/utilities/res32.txt

YevgenJohn commented 4 years ago

The candidates for further examination seems to be

acc

div.P

Whether alt and hw.alt are both needed.

With these results collected in one place (res* files above), would you like me to proceed with some tags unification? I guess we can eliminate upper-lower-case diffirences to start with something and get an idea where to go with it.

drdhaval2785 commented 4 years ago

Upper lower unification seems to be a good starting point.

A tip.

There are some tags in xxx.txt file. While converting it to xxx.xml, a few more are added.

So first check whether the tag you want to unify is in xxx.txt file or not.
If yes, make change to xxx.txt file in csl-orig repo.
If the tag is introduced only in xxx.xml and is absent in xxx.txt, the culprit is make_xml.py file in csl-pywork folder.
Then you need to modify make_xml.py file.

You DO NOT make changes to xxx.xml file. xxx.xml file is computed from xxx.txt file every time we rerun. It is a COMPUTED file.

YevgenJohn commented 4 years ago

Thank you for this information, yes, I looked the chain of scripts mentioned in the installation instructions and saw those transformations (lacking many details understanding upon viewing it first time). I will try to do lower case unification to get up to speed with it and deepen understanding of those internals. I will create a separate branch for that and will do merge request later on, rather than committing directly to the code, so the team has a chance to review changes before they merge into master branch (we use this approach at work to minimize impact on the stable version of repo). Thank you!

funderburkjim commented 4 years ago

Problem with `<H>` tag in acc

Looking at res32 and res23, there is 000049 acc <H>.
But looking at acc.xml (via cologne/acc/pywork/acc.xml) with text editor, there is no <H> tag.

@YevgenJohn -- We need to find out what's going on here, correct and rerun.

YevgenJohn commented 4 years ago

I found the only instance where lang n=Greek happens to appear. I created my own branch, fixed it there, and created a pull request which team can review and click 'merge' so the change will be merged to the master branch: https://github.com/sanskrit-lexicon/csl-orig/pull/4 We might need to adjust csl-pywork/blob/master/v02/makotemplates/pywork/one.dtd to exclude Greek from the option list: <!ATTLIST lang n (greek | arabic | meter |slavic|russian|Russian|Greek|oldhebrew|Old-Church-Slavonic|Arabic|Hindustani|Persian|Turkish) #REQUIRED> I only don't know how to validate that such modification won't break the validation, i.e. how to repeat the validation process. Please review and advise (the first step is slow so I make sure I don't break things, further changer will be faster, and I'll include more changes into one pull request, so I don't bother much with review and merge too often). Thank you P.S. Please advise if you'd prefer following pull-request model of would like to use previous way of committing directly into the master branch.

funderburkjim commented 4 years ago

How do I review the pull request?

YevgenJohn commented 4 years ago

That's the url: https://github.com/sanskrit-lexicon/csl-orig/pulls The UI shows it under pull request tab for the repository. Thank you! I found that model convenient, I create my branch from the master one, do changes there, send pull request for review and merge, and once changes are merged to the master branch I remove my branch. I repeat the cycle for the next chunk of changes, which gives more control over master branch modification (the team has a chance to review and vote against including changes contrary to merging changes to the master branch without review, so unwanted changes are need to be wipe-out in subsequent commits).

funderburkjim commented 4 years ago

OK -- have examined pull request. Looks good. Have merged and updated csl-orig repositories locally and at Cologne.

In this case, since <lang n="Greek"> occurred only in one dictionary (from the res32 file), and since its replacement <lang n="greek"> occurs already in one.dtd, we would be safe to remove 'Greek' from the <!ATTLIST lang n enumeration in one.dtd.

So, go ahead and do that (modify one.dtd in csl-pywork/v02).

The change to csl-orig needs to flow through to the displays. To accomplish this, in csl-pywork/v02:

locally, sh generate_dict.sh mw72 ../../mw72
At Cologne, sh generate_dict.sh mw72 ../../MW72Scan/2020/ [I'll do this once one.dtd is updated].

Note: When this is done at Cologne, the 'xmllint' step would be the place that shows any non-validation of xxx.xml with xxx.dtd.
Note: Locally, if you have xmllint in your Centos installation, running sh generate_dict.sh mw72 ../../mw72 would also confirm that all is ok. You could do this before the csl-pywork commit.

Let's continue with the pull-request model for csl-orig. For csl-pywork and the others, we can use pull-request or direct master-branch commits, whichever you prefer.

As a final word, suggest you rerun the res23,32 in analytics branch once the above is done. That way, we'll have one less thing in those lists to consider.

YevgenJohn commented 4 years ago

Understood, processing the steps, thank you!

funderburkjim commented 4 years ago

@YevgenJohn I went ahead and dropped 'Greek' from enumerated attribute values for attribute 'n' of element 'lang' (in csl-pywork/v02/makotemplates/pywork/one.dtd). Reason: was making another minor change to one.dtd. Namely, removing element 'g', which was used only once in dictionary 'yat'; replaced it there (in c_orig) by <lang n="greek">.

I hope this does not interfere with your work -- if so, we'll have to develop ways not to step on each other's toes.

YevgenJohn commented 4 years ago

I have converted into lowercase and update dtd file. I also rebuild analytics results: https://github.com/sanskrit-lexicon/csl-pywork/blob/analytics/v02/utilities/res32.txt

YevgenJohn commented 4 years ago

I hope this does not interfere with your work -- if so, we'll have to develop ways not to step on each other's toes.

That didn't interfere at all. I create my own sandbox branch right before I do a change, I manage it into pull request which shows me if it has any conflicts for merge. I do only merge it back when there is no conflict for changes, so I can detect it and redo if needed. Thank you

gasyoun commented 4 years ago

examined pull request. Looks good. Have merged and updated csl-orig repositories locally and at Cologne.

Although @drdhaval2785 said he will be busy now, can he try to approve a pull request next time as well, so we are all in the loop?

Let's continue with the pull-request model for csl-orig. For csl-pywork and the others, we can use pull-request or direct master-branch commits, whichever you prefer.

Great, that's a small, but important step. Thanks @YevgenJohn for teaching it.

I create my own sandbox branch right before I do a change, I manage it into pull request which shows me if it has any conflicts for merge. I do only merge it back when there is no conflict for changes, so I can detect it and redo if needed.

Sounds like you do have some experience. Like we should all try to live that way.

sanskrit-lexicon / csl-pywork

dtd study #9

The dtd will help

comment on run.sh

comment on Python

lxml

Problem with `<H>` tag in acc

sanskrit-lexicon / csl-pywork

dtd study #9

The dtd will help

comment on run.sh

comment on Python

lxml

Problem with <H> tag in acc

Problem with `<H>` tag in acc