Scanned images for dictionaries on S3

funderburkjim commented 9 years ago

This continues the discusion begun here.

The scanned images for each of the digitized dictionaries have now been backed up to S3.

Total # of bytes = 9441762881 (9+GB) Total # of files = 47226 (including 36 directory files).

This took about an hour, due to fantastic bandwidth at both Cologne and AWS S3.

I'll make a guide to the file-name patterns tomorrow.

I may also backup the ancillary scans (Whitney Roots, Kale Grammar, Westergaard Roots) to S3.

gasyoun commented 9 years ago

Ancillary scans are of greatest interest. It seems out of the three mentioned only for Kale we do not have an .xml file of the text itself. I'm working on Apte's Sanskrit Syntax. So I guess I'll not work on Kale's Grammar .xml, but wonder if Peter had it in his plans. As a former worker at a backup company I might only salute this effort. 9 GB in one hour is a speed almost impossible.

funderburkjim commented 9 years ago

A list of the filenames of the scans is here.

A summary of the filenames is here

funderburkjim commented 9 years ago

pdfs for the dictionaries are now backed up to the pdfs folder of the sanskrit-lexicon bucket on AWS S3.

The names of the files in this folder are here.

There are 75 files, 9.4GB.

drdhaval2785 commented 9 years ago

Great to document all that is backed up. @funderburkjim Can you do the same for web and pywork folders as well ? That would ensure that at least the names of file are known to open community.

gasyoun commented 9 years ago

Should not we make all names uniform, including

BENFEY.pdf
apte.pdf
burnouf.pdf
pgn.pdf

right now? I guess there will be no other chance.

funderburkjim commented 9 years ago

re: make all names uniform right now?

Now IS a good time to identify irregularities, and propose changes that would promote uniformity.

gasyoun commented 9 years ago

I agree. I have identified those that seem fishy to me.

+skd1_bookmark.pdf
+skd2_bookmark.pdf
+skd3_bookmark.pdf
+skd4_bookmark.pdf
+skd5_bookmark.pdf
+skd_title.pdf

To be sorted properly skd_title.pdf -> skd0_title.pdf so it will be

+skd0_title.pdf
+skd1_bookmark.pdf
+skd2_bookmark.pdf
+skd3_bookmark.pdf
+skd4_bookmark.pdf
+skd5_bookmark.pdf

Actually I do not see no reason to keep the _bookmark part in the filenames. Publishing year would make more sense. Or nothing at all.

funderburkjim commented 9 years ago

The Xweb1.zip files have been backed up to S3. The list of filenames (and approximate sizes) is here.

The contents of Xweb1.zip contains the code and data for the dictionary X displays, except that the scanned images are absent. When unzipped and put in the appropriate spot for a php web server, the files in Xweb1.zip should give a local copy of the displays.

Xweb1.zip is available for download on the downloads page for dictionary X at Cologne .

The total size of all the Xweb1.zip files is about 500MB.

I'm not sure whether the Xweb.zip files (which DO contain the scanned images) should be backed up to S3; their total size would be approximately 10GB. These files are also available on Cologne server.

funderburkjim commented 9 years ago

In preparing for the pywork backups, I discovered a directory of PW scans in PDF form. These were backed up to the scans/PWpdfs directory; here are the file names.

funderburkjim commented 9 years ago

zipped backups of the 'pywork' and 'orig' directories have been posted to S3.

Here are the file lists for pywork and for orig.

Altogether, a bit over 1GB for these zips.

The zips are only slightly 'optimized', so there are rough edges that a technical user would have to deal with.

funderburkjim commented 9 years ago

blobs/mw_aux.zip uploaded (about 4MB).

This contains material for several subsystems used in the MW2014 displays. These are organized into subfolders:

mwab              abbreviations
mwauthorities  literary sources
mwgreek         Greek text in MW
mwkeys           Associates to each key1 a set of L-numbers
westmwtab      Links to Westergaard *
whitmwtab       Links to Whitney Roots*

* The scans and displays for these two references are not here directly. They have yet to
be backed up.

funderburkjim commented 9 years ago

Backup of material for ancillary scans.

blobs: Kale.zip, Whitney.zip, Westergaard.zip. Each of these three is a standalone application providing indexed access to the scanned images of the respective works. They are closely based on Cologne applications. The zips contain both the php application, and the required scans. The sizes are shown here.

For Kale, the individual png image files were also backed up to the scans/KALE directory of the S3 bucket. The file names have been added to the list and the file name convention to the summary.

gasyoun commented 9 years ago

pywork might be the most interesting. 'Xweb.zip files (which DO contain the scanned images)' are not much of use, it's just one more variation. I would insist on having one version fully ready for web display on local network - everything else LEGO-like.

funderburkjim commented 9 years ago

Just a comment on the different blobs.

Currently there are two blobs for each dictionary X (in the blobs folder of the S3 bucket)

X_orig.zip the digitization text files (X.txt) in all its forms, from original (X_orig.txt) to current final (X.txt)

X_pywork.zip Programs of three types: for updating. transforming from X_orig.txt to X.txt; usually, update.sh details the program sequence.

for generating headword lists (hw0, hw1, hw2)

for constructing X.xml from Xhw2.txt and X.txt.

Then there is the web1 folder of the S3 bucket. This contains, for each dictionary X Xweb1.zip. This contains the php-based displays for the dictionary X (without the scans). The dictionary data is in a sqlite3 database constructed from X.xml.

Thus, to have a working dictionary on a desktop (or other) php web server, all you need is Xweb1.
This zip Xweb1.zip is identical to that on the downloads page for dictionary X at Cologne.

If you want to do maintenance (corrections, alterations of the xml structure, etc.), you need both X_orig and X_pywork.

Since nobody much but me has experimented with this, there are probably a few rough edges.

funderburkjim commented 9 years ago

I think backups are finished.

drdhaval2785 commented 9 years ago

@funderburkjim I opine that rather than names of zip files - a list of files in them and what they do would be of more importance. It is time you write some small (at least one line) about each of the code files you back up there.

funderburkjim commented 9 years ago

@gasyoun Re: I't seems out of the three mentioned only for Kale we do not have an .xml file of the text itself. I'm working on Apte's Sanskrit Syntax'

There is no digitization for either Kale or Westergaard.

There is a digitization of Whitney, which Peter and Malcolm and assistants developed; it does not reside in any of the Cologne backups, and is not used in the Whitney display mentioned.

There IS a display of Whitney's Roots on Peter's site (sanskritlibrary.org) under Reference, which is based on the digitization.

funderburkjim commented 9 years ago

Re: 'I'm working on Apte's Sanskrit Syntax'

Is there a link to the scans? What is the text. Are you working on a digitization?

funderburkjim commented 9 years ago

renamed BENFEY.pdf to ben.pdf.

renamed burnouf.pdf to bur.pdf

removed apte.pdf, and uploaded ap1_bookmark.pdf and ap2, ap3.

sanskrit-lexicon / COLOGNE

Scanned images for dictionaries on S3 #64