Closed funderburkjim closed 9 years ago
Ancillary scans are of greatest interest. It seems out of the three mentioned only for Kale we do not have an .xml file of the text itself. I'm working on Apte's Sanskrit Syntax. So I guess I'll not work on Kale's Grammar .xml, but wonder if Peter had it in his plans. As a former worker at a backup company I might only salute this effort. 9 GB in one hour is a speed almost impossible.
pdfs for the dictionaries are now backed up to the pdfs folder of the sanskrit-lexicon bucket on AWS S3.
The names of the files in this folder are here.
There are 75 files, 9.4GB.
Great to document all that is backed up. @funderburkjim Can you do the same for web and pywork folders as well ? That would ensure that at least the names of file are known to open community.
Should not we make all names uniform, including
BENFEY.pdf
apte.pdf
burnouf.pdf
pgn.pdf
right now? I guess there will be no other chance.
re: make all names uniform right now?
Now IS a good time to identify irregularities, and propose changes that would promote uniformity.
I agree. I have identified those that seem fishy to me.
+skd1_bookmark.pdf
+skd2_bookmark.pdf
+skd3_bookmark.pdf
+skd4_bookmark.pdf
+skd5_bookmark.pdf
+skd_title.pdf
To be sorted properly skd_title.pdf -> skd0_title.pdf so it will be
+skd0_title.pdf
+skd1_bookmark.pdf
+skd2_bookmark.pdf
+skd3_bookmark.pdf
+skd4_bookmark.pdf
+skd5_bookmark.pdf
Actually I do not see no reason to keep the _bookmark
part in the filenames. Publishing year would make more sense. Or nothing at all.
The Xweb1.zip files have been backed up to S3. The list of filenames (and approximate sizes) is here.
The contents of Xweb1.zip contains the code and data for the dictionary X displays, except that the scanned images are absent. When unzipped and put in the appropriate spot for a php web server, the files in Xweb1.zip should give a local copy of the displays.
Xweb1.zip is available for download on the downloads page for dictionary X at Cologne .
The total size of all the Xweb1.zip files is about 500MB.
I'm not sure whether the Xweb.zip files (which DO contain the scanned images) should be backed up to S3; their total size would be approximately 10GB. These files are also available on Cologne server.
In preparing for the pywork backups, I discovered a directory of PW scans in PDF form. These were backed up to the scans/PWpdfs directory; here are the file names.
blobs/mw_aux.zip uploaded (about 4MB).
This contains material for several subsystems used in the MW2014 displays. These are organized into subfolders:
mwab abbreviations
mwauthorities literary sources
mwgreek Greek text in MW
mwkeys Associates to each key1 a set of L-numbers
westmwtab Links to Westergaard *
whitmwtab Links to Whitney Roots*
* The scans and displays for these two references are not here directly. They have yet to
be backed up.
Backup of material for ancillary scans.
blobs: Kale.zip, Whitney.zip, Westergaard.zip. Each of these three is a standalone application providing indexed access to the scanned images of the respective works. They are closely based on Cologne applications. The zips contain both the php application, and the required scans. The sizes are shown here.
For Kale, the individual png image files were also backed up to the scans/KALE directory of the S3 bucket. The file names have been added to the list and the file name convention to the summary.
pywork might be the most interesting. 'Xweb.zip files (which DO contain the scanned images)' are not much of use, it's just one more variation. I would insist on having one version fully ready for web display on local network - everything else LEGO-like.
Just a comment on the different blobs.
Currently there are two blobs for each dictionary X (in the blobs folder of the S3 bucket)
X_orig.zip the digitization text files (X.txt) in all its forms, from original (X_orig.txt) to current final (X.txt)
X_pywork.zip Programs of three types: for updating. transforming from X_orig.txt to X.txt; usually, update.sh details the program sequence.
for generating headword lists (hw0, hw1, hw2)
for constructing X.xml from Xhw2.txt and X.txt.
Then there is the web1 folder of the S3 bucket. This contains, for each dictionary X Xweb1.zip. This contains the php-based displays for the dictionary X (without the scans). The dictionary data is in a sqlite3 database constructed from X.xml.
Thus, to have a working dictionary on a desktop (or other) php web server, all you need is Xweb1.
This zip Xweb1.zip is identical to that on the downloads page for dictionary X at Cologne.
If you want to do maintenance (corrections, alterations of the xml structure, etc.), you need both X_orig and X_pywork.
Since nobody much but me has experimented with this, there are probably a few rough edges.
I think backups are finished.
@funderburkjim I opine that rather than names of zip files - a list of files in them and what they do would be of more importance. It is time you write some small (at least one line) about each of the code files you back up there.
@gasyoun Re: I't seems out of the three mentioned only for Kale we do not have an .xml file of the text itself. I'm working on Apte's Sanskrit Syntax'
There is no digitization for either Kale or Westergaard.
There is a digitization of Whitney, which Peter and Malcolm and assistants developed; it does not reside in any of the Cologne backups, and is not used in the Whitney display mentioned.
There IS a display of Whitney's Roots on Peter's site (sanskritlibrary.org) under Reference, which is based on the digitization.
Re: 'I'm working on Apte's Sanskrit Syntax'
Is there a link to the scans? What is the text. Are you working on a digitization?
renamed BENFEY.pdf to ben.pdf.
renamed burnouf.pdf to bur.pdf
removed apte.pdf, and uploaded ap1_bookmark.pdf and ap2, ap3.
This continues the discusion begun here.
The scanned images for each of the digitized dictionaries have now been backed up to S3.
Total # of bytes = 9441762881 (9+GB) Total # of files = 47226 (including 36 directory files).
This took about an hour, due to fantastic bandwidth at both Cologne and AWS S3.
I'll make a guide to the file-name patterns tomorrow.
I may also backup the ancillary scans (Whitney Roots, Kale Grammar, Westergaard Roots) to S3.