sanskrit-lexicon / PWK

Sanskrit-Wörterbuch in kürzerer Fassung, 7 Bände Petersburg 1879-1889
3 stars 1 forks source link

Request for pw.xml file #19

Closed drdhaval2785 closed 8 years ago

drdhaval2785 commented 8 years ago

@funderburkjim In this literary resource thing, we are mostly going to work with pw.xml file for grepping our needed data. I would request you to put pw.xml file in this repository and update it as soon as some corrections are made in it. e.g. I ran the crefmatch.py code and got the following as the entry in pwbib not found in cref VET.(U.). When I searched pw.xml I found <ls>VET.(U.</ls>) as the entry. I remember that we have already corrected these mismatched brackets issue some time back. Only because my copy of pw.xml is not keeping updated with this data, I am getting this error.

If you place the latest and update it often, false positives would be lesser.

gasyoun commented 8 years ago

Only because my copy of pw.xml is not keeping updated with this data, I am getting this error - yeah, the need gets bigger after we implement so many changes on a regular basis. Wonder if it's technically possible for Jim.

funderburkjim commented 8 years ago

I would prefer NOT to put pw.xml in the PWK repository. The main reason regards its size (37MB). This is a large file for git to keep track of. I have run into the situation (in some other directory, maybe MWlexnorm which has several multi-megabyte files) where it may take 10-15 minutes (rough estimate) for git to create a commit.

However, in working with this 'ls' task, I've come up with a quite manageable way to get a fresh copy of pw.xml from Cologne that makeabbrv (or other pwk programs) can operate on. I described it in the pw_dhaval readme.md, but in case that explanation was confusing, here are a few more words on the subject.

  1. It is part of the correction installation process at cologne to regenerate the downloads appearing on the downloads page . In particular, when corrections are made to PW, the 'pwxml.zip' file is regenerated. So this file is the starting point for regenerating a local copy that programs in the PWK repository can work with.
  2. The pwxml.zip file contains a directory called just 'xml', Here is a quick review of what is in this 'xml' folder.
    • pw.xml
    • pwheader.xml (contains the license for usage of pw.xml)
    • several programs and data files involved in the conversion of pw.txt from pw_orig.txt. These are what I thought were the essential parts of pywork for general distribution, but does not contain all of pywork.
  3. Now, how to get the latest version of pwxml.zip in a convenient spot on the local machine? My solution is to put 'xml' folder into the parent of the local copy of the PWK repository, and to rename that folder to 'pwxml'. A GitBash script using curl can do all of this. In more detail.
  4. The local GitHub structure looks like:
    • GitHub
      • PWK
      • CORRECTIONS
      • (other repositories)
      • pwxml (constructed by running pwxml_init.sh, as described in step 5)
      • pwxml_init.sh (get this from the readme.md file mentioned above)
  5. The pwxml_init.sh Bash script can be run on the local machine, in a GitBash terminal:

    • cd to the GitHub location. In my setup, GitHub is the the Documents folder, and GitBash opens in the Users/Jim folder, which contains Documents folder. So, to do this CD, the command is

      cd Documents/GitHub

    • sh pwxml_init.sh

      After this script runs, a fresh copy of pwxml is now in the local machine, in the position described in Step 4. pw.xml is a file in this pwxml directory.

    • Programs or scripts in PWK which need to use pw.xml, now have a stable relative path to pw.xml. I modifed 'abbrv.py' (in pw_dhaval of PWK directory) to read in the path to pw.xml as a command-line argument (sys.argv[1]), and I modified makeabbrv.sh to pass the appropriate path, which is in this case '../../../../pwxml/pw.xml'.
PW=../../../../pwxml/pw.xml
if !([ -e $PW ])
 then
  echo "path to PW does not exist: $PW"
  echo "See pw_dhaval/readme.md for where to get pw.xml"
  exit 1
fi

python abbrv.py $PW
echo "Converting the Anglicized Sanskrit to IAST"

So, that's the system. It boils down to two steps:

  1. rerun pwxml_init.sh to get a fresh copy of pw.xml
  2. write programs to use the relative path to pw.xml.

You only need to rerun pwxml_init.sh when there have been changes to pw since you last got a copy of pwxml.

This seems to work well for me.

@drdhaval2785 Does this work ok for you?

funderburkjim commented 8 years ago

For ease of reference, here is pwxml_init.sh:

echo "downloading pwxml.zip"
curl -o pwxml.zip http://www.sanskrit-lexicon.uni-koeln.de/scans/PWScan/2014/downloads/pwxml.zip
echo "unzipping pwxml.zip to folder xml"
unzip pwxml.zip
echo "renaming xml to pwxml"
rm -r pwxml
mv xml pwxml
echo "removing pwxml.zip"
rm pwxml.zip

Put this pwxml_init.sh file into the 'GitHub' directory on your local machine.

gasyoun commented 8 years ago

Detailed enough, thanks. In this case the only question is to know when to rerun. So manual checking is still required.

funderburkjim commented 8 years ago

Regarding 'when to rerun':

One answer is: Rerun any time you need to be sure you have the latest copy of pw.xml.

If local bandwidth is good, then this answer would suffice.

If local bandwidth is not so good (i.e., if downloading pwxml.zip is expensive), then some kind of date-checking preliminary would probably need to be added to pwxml_init.sh. The idea would be to compare the date of the cologne 'pwxml.zip' file to the date of something from the prior download of pwxml.zip, and only download if the local version was determined to be out of date.

Exactly how to do this date comparison could be tricky, due to time-zone differences.

Perhaps when I create pwxml.zip at Cologne, I need to add a tiny file which would contain the version date. And then the pwxml_init.sh program could compare the contents of this tiny file to the contents of this tiny file in the local copy, and only do the full pwxml download if there is a difference.

I'm just 'thinking out loud' here.

gasyoun commented 8 years ago

Thinking makes sense. We should not make it difficult for ourself.

drdhaval2785 commented 8 years ago

@funderburkjim Bandwidth is an issue in India for sure. After I exhaust some 4 GB data usage, my internet speed reduces to 256 kb/s. Too slow to download 70 MB. So keep my consideration. Not that urgent. But yes, it is important to help me do it well.

As of now, I can see you are adding corrections list to manualByLine02.txt and then rerunning the update.sh or some part of it. If you can make manualByLine02.txt as part of this PWK repository, I may be able to keep my file updated and rerun the updation script and get as nearly fresh copy of pw.xml as possible.

funderburkjim commented 8 years ago

@drdhaval2785 Since you mention '70MB', I want to clarify one distinction, that should definitely help.

There are two downloads for PWK that have been mentioned recently:

  1. The download of the pw environment. On cologne server, I generate backups to s3 for pw the orig, pywork, and web directories. The intention here is to download these directories under \c\xampp\htdocs\cologne\pw . This is what is described here. This is the download that is roughly 70mb for pw/
  2. The download of the pwxml.zip file from the cologne downloads folder. This is a selection of material from the pywork directory. The download here is about 12mb. It contains the latest pw.xml, as well as the 'manualByLine' files used in updating. This is what I described above relative to pwxml_init.sh:, which puts pwxml in the GitHub folder

    So, for your purposes in decreasing bandwidth, and since you are probably just wanting the latest pw.xml, 2 is probably completely adequate for your needs most of the time.

  3. Once you have a copy of 1, it would be possible for you to keep pw.xml up to date with even less bandwidth usage than 2. Namely, you could
    • download the appropriate manualByLine file (currently, manualByLine03.txt for pw)
    • move that manualByLine file to \c\xampp\htdocs\cologne\pw\pywork\
    • do the appropriate 'updateByLine.py' script (to update\pw\orig\pw.txt)
    • do redo_hw.sh (to update /pw/pywork/pwhw2.txt)
    • do redo_xml.sh (to update /pw/pywork/pw.xml)
    • copy /pw/pywork/pw.xml to /c/Documents/GitHub/pwxml/pw.xml

This third option is a little more complicated, so I don't think it is appropriate for a 'general' user. However, it might be the best way (in terms of minimizing bandwidth usage) for you to keep an up-to-date version of pw.xml on your system, as the download would only be of the manualByLine file.

I'm not sure which (2 or 3) you want to use.
If you want to use 2 (12MB download), the pwxml_init.sh script already works. If you want to use 3, (estimated 1MB download), I'll work with you to get that system working. Let me know which way to go.

gasyoun commented 8 years ago

@drdhaval2785 I guess 3 would be optimal and possible to make a command line script to do the needed on Windows.

drdhaval2785 commented 8 years ago

3 is best.

funderburkjim commented 8 years ago

Here is what you can do to 'sync' with pw.

This is only available for pw.

It is assumed that you have already downloaded (recently) the 'pw' environment, and put the 'orig,pywork,web' directories for pw into /c/xampp/htdocs/cologne/pw/, as discussed here

In your Bash terminal, do the following (you could probably put this into a script);

  1. cd /c/xampp/htdocs/cologne/pw/pywork
  2. curl -o pwsync.zip http://www.sanskrit-lexicon.uni-koeln.de/scans/PWScan/2014/pywork/pwsync.zip
  3. unzip pwsync.zip
  4. sh update_sync.sh

Step 1 gets you in the right file system location Step 2 downloads the latest file(s) containing update transactions (here, manualByLine03), and a bash script (update_sync.sh) The download size only about 0.5MB Step 3 obvious Step 4 goes through the steps to update things for pw. Takes a minute or so to run.

I tried it and it seems to work.

If this proves to be a good system for you, I'll put the regeneration of pwsync.zip at Cologne into the update process for pw.

Note: It should do no harm to repeat those 4 steps (i.e., the whole process is 'idempotent').

gasyoun commented 8 years ago

So later possible to reproduce for the rest of, Jim?

drdhaval2785 commented 8 years ago

Have followed the instruction. Next time we have some corrections in PW, I will try to see whether the script does what it is supposed to do.

funderburkjim commented 8 years ago

@drdhaval2785 After corrections from #21, there is new 'sync' data for you to try.

drdhaval2785 commented 8 years ago

@funderburkjim It worked well. Happy. Make it a part of correction handling routine so that update_sync is in sync.

funderburkjim commented 8 years ago

The cologne part of the sync (make_sync.sh) is now part of correction handling for PW. So, update_sync will be in sync.