Closed funderburkjim closed 3 years ago
Since the scanned images take a lot of disk space, let's get some statistics on the current actual disk space at Cologne devoted to scanned images.
Here is a listing of the 34 publicly available dictionaries, along with the space taken up by the scanned images used in the displays, and the number of image files.
acc 110MB 1216
ae 146MB 518
ap90 334MB 1211
ben 66MB 1127
bhs 73MB 634
bop 29MB 421
bor 67MB 808
bur 94MB 394
cae 89MB 677
ccs 141MB 541
gra 110MB 893
gst 39MB 334
ieg 22MB 580
inm 161MB 852
krm 536MB 1489
mci 399MB 1024
md 141MB 395
mw 488MB 1370
mw72 184MB 1212
mwe 331MB 860
pe 52MB 929
pgn 19MB 420
pui 83MB 2232
pw 612MB 2141
pwg 1546MB 4737
sch 62MB 406
shs 602MB 842
skd 503MB 3164
snp 29MB 135
stc 90MB 904
vcp 543MB 5447
vei 73MB 1155
wil 336MB 988
yat 91MB 928
TOT 8217MB 40984
So, all in all there is about 8.2GB of space used and about 41,000 individual scanned images.
du -s <directory-name>
, was also used as a check. The two
size estimations were noticed to be the same or almost the same for each dictionary.Github has this to say about repository size limits (reference]):
File and repository size limitations We recommend repositories be kept under 1GB each. Repositories have a hard limit of 100GB. If you reach 75GB you'll receive a warning from Git in your terminal when you push. This limit is easy to stay within if large files are kept out of the repository. If your repository exceeds 1GB, you might receive a polite email from GitHub Support requesting that you reduce the size of the repository to bring it back down. In addition, we place a strict limit of files exceeding 100 MB in size. For more information, see "Working with large files."
None of the image files exceeds 100MB ; average size is 0.2MB per image ( 8217MB / 40984 files). So we're ok per file.
Based on the 'hard limit', all of the images could be kept in one repository (8.2GB < 100GB).
I think there should be one repository for the images for each dictionary. This would allow user flexibility in installation; each user could clone the repositories of just those dictionaries of interest. [Some suggested intallation instructions will be provided in comments below.]
If the images for each dictionary were kept in a separate repository, then there would be 34 new repositories, and all but 1 (pwg) would take less space than the 1GB Github recommended repository maximum size,
The repository names could be 'scans-xxx', were 'xxx' is one of the 34 (lower-case) dictionary abbreviations; so 'scans-acc', 'scans-mw', 'scans-ap90', etc.
It might be simplest to add the 34 new repositories to the sanskrit-lexicon Github organization. Currently there are 36 repositories in sanskrit-lexicon, so there would be 70 repositories after adding the 34 new ones. Since Github imposes no limits on the number of repositories per organization (ref: About repositories), there would be no problem in having 70 or more repositories under sanskrit-lexicon.
The naming convention 'scans-xxx' would allow easy filtering of the image repositories from among all the sanskrit-lexicon repositories.
Please provide feedback regarding the above suggestion!
In the meantime, I'll set up procedures along the lines indicated above, using one or two dictionaries for the prototyping.
The images are not going to change frequently.
So instead of one repository per dictionary, I propose only one repository for all dictionaries. We can keep .zip / .tar / .tar.gz file like acc.zip / ae.zip etc in the repository.
The installation instruction may give a prompt "Do you want to download dictionary page images for local use? It will take roughly XXX MB of download and YYY MB of disc space."
If user says no, we don't download images. If he says yes, we download images.
I agree images will almost never change.
My experience with zip is that images compress very little.
If one repository for all dictionaries, then cloning that repository will require a user to download 8GB. 8GB is a lot! Its roughly equivalent to 4 copies of Windows 10 or MAC-OSX. Download would take several hours on lower bandwidth connections.
If the user only wants the images of, say, MW dictionary, then he would have to download an additional 7.5GB of unneeded stuff just to get the 500MB of images that he wants.
If user actually wants the images for all dictionaries, it will still take a long time -- about the same amount of time/space whether the images are in one repository or 34 repositories.
What are the downsides of separate repositories?
There are no downside of separate repositories, except too many repositories. If we are OK with it, we can keep the images in separate repos.
To deal with the 'too many repositories' issue, we could put all the image repositories in another Github organization. As an experiment to this end, I've made a 'sanskrit-lexicon-scans' organization. Currently it is owned by me (@funderburkjim). (How can ownership be transferred or shared?) @drdhaval2785 , @gasyoun , and @YevgenJohn have been invited to be on the 'team' of the new organization.
Am currently working to automate process of initializing sanskrit-lexicon-scans/xxx repositories.
This repository now exists, and is populated with the images.
Also request feedback on the choice of sanskrit-lexicon-scans organization
Also populated.
Fantastic!! Please let me look through it and I will try if I could do some of the listed as next steps, I might try csl-websanlexicon as well to see if I understand that enough to make working changes. This is really important for a local VM to be self-sufficient, in case it works offline or in case of the main server DR situation, so a user can still refer to the scanned pages to make sure the digitized version is in sync with them.
The change is very brief. Just in dictinfo.php.
Here's how to see the change in action.
I'm assuming you already have a local machine or a server set up and populated with the acc or ae dictionary installed (these are currently the only ones with scans on Github). So you have 'cologne/acc', 'cologne/ae', 'cologne/csl-websanlexicon', 'cologne/csl-pywork', 'cologne/csl-orig'.
Install the new code at least for xxx=acc and ae.
Same steps as under 'Before updating to local images' above. But now, for example, src="http://localhost/cologne/scans/ae/pdfpages/ae-119.pdf" which proves you are using local scanned images.
If you were to use your local copy of mw, it would still show images from cologne, since there are no local images yet for mw. (i.e., scans/mw/pdfpages is not there).
csl-apidev is another piece that can be run locally. We haven't discussed it yet.
In order for local installations to use local images, It needs a modification similar to that of csl-websanlexicon.
I'll open an issue related to this.
test that local images are being used for acc, ae
Same steps as under 'Before updating to local images' above. But now, for example, src="http://localhost/cologne/scans/ae/pdfpages/ae-119.pdf" which proves you are using local scanned images.
Great! It works on my local VM: <embed id="plugin" type="application/x-google-chrome-pdf" src="http://localhost/cologne/scans/ae/pdfpages/ae-119.pdf" stream-url="chrome-
If you were to use your local copy of mw, it would still show images from cologne, since there are no local images yet for mw. (i.e., scans/mw/pdfpages is not there).
I searched for 'karma' in MW, it points to: <embed id="plugin" type="application/x-google-chrome-pdf" src="http://www.sanskrit-lexicon.uni-koeln.de/scans/MWScan/MWScanpdf/mw0258-kartRguptaka.pdf" stream-url="chrome-
Do we plan to have them all in Git so they can be pulled locally? Thank you!
Do we plan to have them all in Git ?
Yes. I wanted to get some feedback on the wording of the readme and the choice of license before installing github repositories in sanskrit-lexicon-scans for all the dictionaries.
@funderburkjim
Regarding licence, I prefer GPLv3.
Readme should give installation instructions for local images.
@funderburkjim
Regarding licence, I prefer GPLv3.
Makes sense to me.
There are several comparisons between these two licenses.
From this comparison,
GPL v3 and BY-SA 4.0 are similar licenses with similar aims. But because GPLv3 was written specifically for licensing software, it does have some differences from BY-SA ...
The main reasons I suggested the CC-BY-SA license for these scanned image repositories:
Given the above, I still have a slight preference for CC-BY-SA license for these repositories. Currently our software repositories (e.g. csl-pywork and a couple of others) do not have a license; if we add a license, GPLv3 might be a good choice. Another option would be MIT license.
@drdhaval2785 and @gasyoun : In light of these comments, do you have any further thoughts on the choice of license? Do you have a strong preference for the GPLv3 license for these scanned image repositories?
Based on your comments, I am OK with CC for images and GPLv3 for csl-pywork, apidev andd websanlexicon
I asked Thomas Malten and Peter Scharf their opinion regarding license.
Thomas is fine with CC BY-SA.
Peter prefers CC BY-NC-SA. His reason:
...the scanning work was included under grants from the NEH and DFG, the license should include non-commercial as well. ... Otherwise, if someone collects money from the use of the images, these granting institutions may take offence if they don’t get a cut. The same is my feeling about the text and XML.
Here is a link to cc by-nc-sa
Here is an excerpt from https://wiki.creativecommons.org/wiki/NonCommercial_interpretation; the sentence marked off by double-asterisks is part of what I think Peter has in mind.
Like all CC licenses, the NC licenses are non-exclusive. This means that an NC licensor is free to offer the material under other terms, including on commercial terms. A frequently discussed use case for the NC licenses is a creator who wishes to allow NonCommercial use but also authorizes commercial uses in exchange for payment. (Additional permissions such as this may always be offered; licensors may also use our CC+ protocol to offer these in a standardized manner.) Also, licensees are always free to contact licensors to ask permission to use the work for commercial purposes.
My own opinion is that it doesn't matter much. I'm fine to go with cc by-nc-sa.
What do others think?
CC BY-NC-SA is fine to me too.
Thomas also concurs with NC.
I will revise the acc and ae licenses to BY-NC-SA, and then continue with the installation
of the rest of the images.
The other thing that needs to be done (@drdhaval2785 requested above) is installation instructions (i.e. how to use the scanned images in a local installation).
I'll make a 'sanskrit-lexicon-scans/documentation' repository, and make a link in the README.MD for each dictionary to the README.md in the documentation repository.
All repositories sanskrit-lexicon-scans/xxx have now been populated with the images.
sanskrit-lexicon-scans/documentation/README.md exists, but is currently incomplete.
Maybe someone else could work on this README.md. If needed, I'll provide some content next week.
I thought it sounded friendly to be a Free Culture License, so dropped the NC.
Exactly.
Currently our software repositories (e.g. csl-pywork and a couple of others) do not have a license
Time to add.
Do you have a strong preference for the GPLv3 license for these scanned image repositories?
No, no strong preferences. MIT is good as well.
Based on your comments, I am OK with CC for images and GPLv3 for csl-pywork, apidev andd websanlexicon
So am I.
sanskrit-lexicon-scans/xxx
It's owned only by you, Jim, right? Thinking about a case of emergency and is why I ask.
Maybe someone else could work on this README.md.
@YevgenJohn give it a try?
It's owned only by you, Jim, right?
I think I 'invited' @drdhaval2785 , @YevgenJohn , and you (@gasyoun ) to the 'team' for the 'sanskrit-lexicon-scans' organization. Did you receive invitation?
Although I created the organization, my intent was to have it jointly 'owned' by all 4.
Do I need to do something in settings regarding ownership, so that I am not the only 'owner'?
Scanned images for all dictionaries
All repositories sanskrit-lexicon-scans/xxx have now been populated with the images.
Thank you very much! I'm trying to make a standalone VM with images, disconnect its network interfaces and see if links to the pictures work (as it won't be able to reach out to Cologne server).
Apologies for not contributing to the licenses discussion, as I don't know that subject well enough. Thank you!
Did you receive invitation?
Only by accident now I see it. Others are here by now.
Do I need to do something in settings regarding ownership, so that I am not the only 'owner'?
Yes, for each person you set them to be a non-member, but owner.
How do I change @drdhaval2785 (and others) from Member to Owner?
make a standalone VM
@YevgenJohn Why don't you start an issue regarding this standalone VM. It would be interesting to better understand what is meant by a standalone VM, and how it would be used.
Absolutely, very good idea! I wonder how much space the VM image would take with all scanned pages uploaded. I just added another disk to the VM to accommodate it. My goal is to provide a ready product linguists can plug in and use (when offline, or if they want to run heavy query which would otherwise slow shared server down, so we can remove upper limit on number of results), as asking them to do Linux commands to set it up locally seems a bit of impractical to me. Thank you!
Local scanned images have stabilized. Closing the issue.
This issue is to deal with an enhancement to the local dictionary installation process (as described in the readme.md at csl-pywork/v02. The feature regards installation of local copies of the scanned images for each dictionary; this feature was mentioned in #6 comments.