sanskrit-lexicon / COLOGNE

Development of http://www.sanskrit-lexicon.uni-koeln.de/
18 stars 3 forks source link

Download specific dict.xml file #106

Closed drdhaval2785 closed 7 years ago

drdhaval2785 commented 7 years ago

Nowadays it has become important for me to fetch the latest DICT.xml files of various dictionaries to generate Stardict files from them. Currently I am fetching the data from the amazon server which Jim said some year ago. This fetches whole lot of other stuffs along.

Is it possible to keep some script which can fetch only the latest .xml file and nothing else?

e.g. sh fetchxml.sh ap or python fetchxml.python ap to fetch latest ap.xml file? Maybe the folder may have two items. Script and a subfolder output where all the fetched files go.

Doable @funderburkjim ?

funderburkjim commented 7 years ago

In the current S3 backup, there is no bucket devoted to just the xml forms of the dictionaries.

More specifically, you want the fetch to be of a zip-compressed version of a particular X.xml, right?

This should be doable. Will put on todo list.

drdhaval2785 commented 7 years ago

Yes, zip or tar.gz anything compressed will do.

funderburkjim commented 7 years ago

Have added an xml file to the S3 backup regimen that is part of each dictionary update.

Since only acc has been updated since this regimen was added, only the acc dictionary has such a file.

Here is a script to download the xml file for a dictionary:

# shell script takes a single argument, a dictionary code
# convert shell script argument to lower case
if [ ! $1 ]; then
 echo "script requires a dictionary code as parameter"
 echo "Usage: sh xmldownload.sh <dictcode>"
 echo "<dictcode> must be one of the dictionary codes"
 echo "see http://www.sanskrit-lexicon.uni-koeln.de/"
 exit 1
fi
DICT=`echo $1 | tr '[:upper:]' '[:lower:]'`
echo "downloading "$DICT"_xml.zip ..."
curl -o "$DICT"_xml.zip http://s3.amazonaws.com/sanskrit-lexicon/blobs/"$DICT"_xml.zip

Assume this script is named xmldownload.sh.

Usage example: sh xmldownload.sh acc

This results in a download of 'acc_xml.zip` from S3.

acc_xml.zip , when unzipped, has the following structure:

gasyoun commented 7 years ago

Maybe the script should do the renaming

Makes sense.

Have added an xml file to the S3 backup regimen

Quick one.

drdhaval2785 commented 7 years ago

Since only acc has been updated since this regimen was added, only the acc dictionary has such a file.

As most of our scripts are indempont, can you please run a (potentially empty) update on all dicts, so that xmls as of now become available for all dicts?

I am asking this because the last time I did update of my local copies of Cologne dicts was one year back. So need to get fresh copies.

funderburkjim commented 7 years ago

Have generated all the xxx_xml.zip files.

Note: total size of all is about 110MB.

'indempont' did you mean 'idempotent' ?

drdhaval2785 commented 7 years ago

On 7 Apr 2017 01:30, "funderburkjim" notifications@github.com wrote:

Have generated all the xxx_xml.zip files.

Note: total size of all is about 110MB.

Hurray..

'indempont' did you mean 'idempotent' ?

Idempotent. I had grossly wrong impression of word in my mind. Thanks for correction.

drdhaval2785 commented 7 years ago

This works well. Closing the issue.

drdhaval2785 commented 7 years ago

https://github.com/sanskrit-lexicon/cologne-stardict/blob/master/updatexml.sh

# shell script takes a single argument, a dictionary code
# convert shell script argument to lower case
dictList=(acc ae ap ap90 ben bhs bop bor bur cae ccs gra gst ieg inm krm mci md mw mw72 mwe pd pe pgn pui pw pwg sch shs skd snp stc vcp vei wil yat)
for DICT in "${dictList[@]}"
do
echo "downloading "$DICT"_xml.zip ..."
curl -o input/zips/"$DICT"_xml.zip http://s3.amazonaws.com/sanskrit-lexicon/blobs/"$DICT"_xml.zip
done

cd input/extracted
for DICT in "${dictList[@]}"
do
echo "unzipping "$DICT"_xml.zip ..."
unzip -o ../zips/"$DICT"_xml.zip
done

This is the code which works for me as of now.

drdhaval2785 commented 7 years ago

Missing header files in the following dictionaries.

AP BHS BOP BOR CAE CCS GRA GST IEG INM KRM MCI MW PD PE PGN PUI PW PWG SHS SKD SNP STC VCP VEI WIL YAT

funderburkjim commented 7 years ago

The missing xxxheader.xml is a bug.

The creation of the xxx_xml.zip file assumes that xxxheader.xml is in the pywork directory.

Originally, xxxheader.xml was kept in the downloads directory.

I have moved xxxheader.xml to pywork only in a haphazard way.

Need to write a script to do this systematically.

And then regenerate all the S3 xxx_xml.zip files.

On todo list, high priority.

gasyoun commented 7 years ago

I have moved xxxheader.xml to pywork only in a haphazard way. Need to write a script to do this systematically.

Ouch, that list is too scary to even look at.

funderburkjim commented 7 years ago

All the xxx_xml.zip S3 backups have been regenerated; all should contain xxxheader.xml files.

I think this issue can be closed.