Closed drdhaval2785 closed 8 years ago
Only because my copy of pw.xml is not keeping updated with this data, I am getting this error - yeah, the need gets bigger after we implement so many changes on a regular basis. Wonder if it's technically possible for Jim.
I would prefer NOT to put pw.xml in the PWK repository. The main reason regards its size (37MB). This is a large file for git to keep track of. I have run into the situation (in some other directory, maybe MWlexnorm which has several multi-megabyte files) where it may take 10-15 minutes (rough estimate) for git to create a commit.
However, in working with this 'ls' task, I've come up with a quite manageable way to get a fresh copy of pw.xml from Cologne that makeabbrv (or other pwk programs) can operate on. I described it in the pw_dhaval readme.md, but in case that explanation was confusing, here are a few more words on the subject.
The pwxml_init.sh Bash script can be run on the local machine, in a GitBash terminal:
cd to the GitHub location. In my setup, GitHub is the the Documents folder, and GitBash opens in the Users/Jim folder, which contains Documents folder. So, to do this CD, the command is
cd Documents/GitHub
sh pwxml_init.sh
After this script runs, a fresh copy of pwxml is now in the local machine, in the position described in Step 4. pw.xml is a file in this pwxml directory.
PW=../../../../pwxml/pw.xml
if !([ -e $PW ])
then
echo "path to PW does not exist: $PW"
echo "See pw_dhaval/readme.md for where to get pw.xml"
exit 1
fi
python abbrv.py $PW
echo "Converting the Anglicized Sanskrit to IAST"
So, that's the system. It boils down to two steps:
You only need to rerun pwxml_init.sh when there have been changes to pw since you last got a copy of pwxml.
This seems to work well for me.
@drdhaval2785 Does this work ok for you?
For ease of reference, here is pwxml_init.sh:
echo "downloading pwxml.zip"
curl -o pwxml.zip http://www.sanskrit-lexicon.uni-koeln.de/scans/PWScan/2014/downloads/pwxml.zip
echo "unzipping pwxml.zip to folder xml"
unzip pwxml.zip
echo "renaming xml to pwxml"
rm -r pwxml
mv xml pwxml
echo "removing pwxml.zip"
rm pwxml.zip
Put this pwxml_init.sh file into the 'GitHub' directory on your local machine.
Detailed enough, thanks. In this case the only question is to know when to rerun. So manual checking is still required.
Regarding 'when to rerun':
One answer is: Rerun any time you need to be sure you have the latest copy of pw.xml.
If local bandwidth is good, then this answer would suffice.
If local bandwidth is not so good (i.e., if downloading pwxml.zip is expensive), then some kind of date-checking preliminary would probably need to be added to pwxml_init.sh. The idea would be to compare the date of the cologne 'pwxml.zip' file to the date of something from the prior download of pwxml.zip, and only download if the local version was determined to be out of date.
Exactly how to do this date comparison could be tricky, due to time-zone differences.
Perhaps when I create pwxml.zip at Cologne, I need to add a tiny file which would contain the version date. And then the pwxml_init.sh program could compare the contents of this tiny file to the contents of this tiny file in the local copy, and only do the full pwxml download if there is a difference.
I'm just 'thinking out loud' here.
Thinking makes sense. We should not make it difficult for ourself.
@funderburkjim Bandwidth is an issue in India for sure. After I exhaust some 4 GB data usage, my internet speed reduces to 256 kb/s. Too slow to download 70 MB. So keep my consideration. Not that urgent. But yes, it is important to help me do it well.
As of now, I can see you are adding corrections list to manualByLine02.txt and then rerunning the update.sh or some part of it. If you can make manualByLine02.txt as part of this PWK repository, I may be able to keep my file updated and rerun the updation script and get as nearly fresh copy of pw.xml as possible.
@drdhaval2785 Since you mention '70MB', I want to clarify one distinction, that should definitely help.
There are two downloads for PWK that have been mentioned recently:
The download of the pwxml.zip file from the cologne downloads folder. This is a selection of material from the pywork directory. The download here is about 12mb. It contains the latest pw.xml, as well as the 'manualByLine' files used in updating. This is what I described above relative to pwxml_init.sh:, which puts pwxml in the GitHub folder
So, for your purposes in decreasing bandwidth, and since you are probably just wanting the latest pw.xml, 2 is probably completely adequate for your needs most of the time.
This third option is a little more complicated, so I don't think it is appropriate for a 'general' user. However, it might be the best way (in terms of minimizing bandwidth usage) for you to keep an up-to-date version of pw.xml on your system, as the download would only be of the manualByLine file.
I'm not sure which (2 or 3) you want to use.
If you want to use 2 (12MB download), the pwxml_init.sh script already works.
If you want to use 3, (estimated 1MB download), I'll work with you to get that system working.
Let me know which way to go.
@drdhaval2785 I guess 3 would be optimal and possible to make a command line script to do the needed on Windows.
3 is best.
Here is what you can do to 'sync' with pw.
This is only available for pw.
It is assumed that you have already downloaded (recently) the 'pw' environment, and put the 'orig,pywork,web' directories for pw into /c/xampp/htdocs/cologne/pw/, as discussed here
In your Bash terminal, do the following (you could probably put this into a script);
Step 1 gets you in the right file system location Step 2 downloads the latest file(s) containing update transactions (here, manualByLine03), and a bash script (update_sync.sh) The download size only about 0.5MB Step 3 obvious Step 4 goes through the steps to update things for pw. Takes a minute or so to run.
I tried it and it seems to work.
If this proves to be a good system for you, I'll put the regeneration of pwsync.zip at Cologne into the update process for pw.
Note: It should do no harm to repeat those 4 steps (i.e., the whole process is 'idempotent').
So later possible to reproduce for the rest of, Jim?
Have followed the instruction. Next time we have some corrections in PW, I will try to see whether the script does what it is supposed to do.
@drdhaval2785 After corrections from #21, there is new 'sync' data for you to try.
@funderburkjim It worked well. Happy. Make it a part of correction handling routine so that update_sync is in sync.
The cologne part of the sync (make_sync.sh) is now part of correction handling for PW. So, update_sync will be in sync.
@funderburkjim In this literary resource thing, we are mostly going to work with pw.xml file for grepping our needed data. I would request you to put pw.xml file in this repository and update it as soon as some corrections are made in it. e.g. I ran the crefmatch.py code and got the following as the entry in pwbib not found in cref
VET.(U.)
. When I searched pw.xml I found<ls>VET.(U.</ls>)
as the entry. I remember that we have already corrected these mismatched brackets issue some time back. Only because my copy of pw.xml is not keeping updated with this data, I am getting this error.If you place the latest and update it often, false positives would be lesser.