local scanned images - Githubissues

funderburkjim commented 4 years ago

This issue is to deal with an enhancement to the local dictionary installation process (as described in the readme.md at csl-pywork/v02. The feature regards installation of local copies of the scanned images for each dictionary; this feature was mentioned in #6 comments.

funderburkjim commented 4 years ago

size estimations

Since the scanned images take a lot of disk space, let's get some statistics on the current actual disk space at Cologne devoted to scanned images.

Here is a listing of the 34 publicly available dictionaries, along with the space taken up by the scanned images used in the displays, and the number of image files.

acc     110MB   1216 
ae      146MB    518 
ap90    334MB   1211 
ben      66MB   1127 
bhs      73MB    634 
bop      29MB    421 
bor      67MB    808 
bur      94MB    394 
cae      89MB    677 
ccs     141MB    541 
gra     110MB    893 
gst      39MB    334 
ieg      22MB    580 
inm     161MB    852 
krm     536MB   1489 
mci     399MB   1024 
md      141MB    395 
mw      488MB   1370 
mw72    184MB   1212 
mwe     331MB    860 
pe       52MB    929 
pgn      19MB    420 
pui      83MB   2232 
pw      612MB   2141 
pwg    1546MB   4737 
sch      62MB    406 
shs     602MB    842 
skd     503MB   3164 
snp      29MB    135 
stc      90MB    904 
vcp     543MB   5447 
vei      73MB   1155 
wil     336MB    988 
yat      91MB    928 
TOT    8217MB  40984

So, all in all there is about 8.2GB of space used and about 41,000 individual scanned images.

Notes

This listing was made by program size_pdfpages.py in scans/awork/misc/misc folder on Cologne server.
The Cologne directories were taken from the _cologne_pdfpages_url method in csl-websanlexicon/v02/makotemplates/web/webtc/dictinfo.php
The for each dictionary's directory, the files with an image suffix ('.pdf','jpg','png') were considered.
The size for each image was got via the Python standard library function os.path.getsize
Another method, using the bash du -s <directory-name>, was also used as a check. The two size estimations were noticed to be the same or almost the same for each dictionary.

funderburkjim commented 4 years ago

Github is a viable location for keeping the images

Github has this to say about repository size limits (reference]):

File and repository size limitations We recommend repositories be kept under 1GB each. Repositories have a hard limit of 100GB. If you reach 75GB you'll receive a warning from Git in your terminal when you push. This limit is easy to stay within if large files are kept out of the repository. If your repository exceeds 1GB, you might receive a polite email from GitHub Support requesting that you reduce the size of the repository to bring it back down. In addition, we place a strict limit of files exceeding 100 MB in size. For more information, see "Working with large files."

None of the image files exceeds 100MB ; average size is 0.2MB per image ( 8217MB / 40984 files). So we're ok per file.

Based on the 'hard limit', all of the images could be kept in one repository (8.2GB < 100GB).

funderburkjim commented 4 years ago

Proposed 34 repository solution

I think there should be one repository for the images for each dictionary. This would allow user flexibility in installation; each user could clone the repositories of just those dictionaries of interest. [Some suggested intallation instructions will be provided in comments below.]

If the images for each dictionary were kept in a separate repository, then there would be 34 new repositories, and all but 1 (pwg) would take less space than the 1GB Github recommended repository maximum size,

Proposed repository naming convention

The repository names could be 'scans-xxx', were 'xxx' is one of the 34 (lower-case) dictionary abbreviations; so 'scans-acc', 'scans-mw', 'scans-ap90', etc.

Proposed github project to contain the scan repositories

It might be simplest to add the 34 new repositories to the sanskrit-lexicon Github organization. Currently there are 36 repositories in sanskrit-lexicon, so there would be 70 repositories after adding the 34 new ones. Since Github imposes no limits on the number of repositories per organization (ref: About repositories), there would be no problem in having 70 or more repositories under sanskrit-lexicon.

The naming convention 'scans-xxx' would allow easy filtering of the image repositories from among all the sanskrit-lexicon repositories.

funderburkjim commented 4 years ago

request feedback

Please provide feedback regarding the above suggestion!

In the meantime, I'll set up procedures along the lines indicated above, using one or two dictionaries for the prototyping.

drdhaval2785 commented 4 years ago

The images are not going to change frequently.

So instead of one repository per dictionary, I propose only one repository for all dictionaries. We can keep .zip / .tar / .tar.gz file like acc.zip / ae.zip etc in the repository.

The installation instruction may give a prompt "Do you want to download dictionary page images for local use? It will take roughly XXX MB of download and YYY MB of disc space."

If user says no, we don't download images. If he says yes, we download images.

funderburkjim commented 4 years ago

I agree images will almost never change.

My experience with zip is that images compress very little.

If one repository for all dictionaries, then cloning that repository will require a user to download 8GB. 8GB is a lot! Its roughly equivalent to 4 copies of Windows 10 or MAC-OSX. Download would take several hours on lower bandwidth connections.

If the user only wants the images of, say, MW dictionary, then he would have to download an additional 7.5GB of unneeded stuff just to get the 500MB of images that he wants.

If user actually wants the images for all dictionaries, it will still take a long time -- about the same amount of time/space whether the images are in one repository or 34 repositories.

What are the downsides of separate repositories?

drdhaval2785 commented 4 years ago

There are no downside of separate repositories, except too many repositories. If we are OK with it, we can keep the images in separate repos.

funderburkjim commented 4 years ago

sanskrit-lexicon-scans organization

To deal with the 'too many repositories' issue, we could put all the image repositories in another Github organization. As an experiment to this end, I've made a 'sanskrit-lexicon-scans' organization. Currently it is owned by me (@funderburkjim). (How can ownership be transferred or shared?) @drdhaval2785 , @gasyoun , and @YevgenJohn have been invited to be on the 'team' of the new organization.

Am currently working to automate process of initializing sanskrit-lexicon-scans/xxx repositories.

funderburkjim commented 4 years ago

sanskrit-lexicon-scans/acc

This repository now exists, and is populated with the images.

There is a README.md file and a LICENSE file
- Request feedback on the wording of the readme, and the choice of license.
The images are in the 'pdfpages' directory.
It has only one branch, gh-pages
- This means that the images could be served from here in a web application. For example, page 1 of acc

Also request feedback on the choice of sanskrit-lexicon-scans organization

sanskrit-lexicon-scans/ae

Also populated.

Next steps

create scripts (in csl-pywork ?) for user downloads of scans
put the local images for dictionary xxx in cologne/scans/xxx/pdfpages
modify code in csl-websanlexicon so local web-app displays will know where images reside locally

YevgenJohn commented 4 years ago

Fantastic!! Please let me look through it and I will try if I could do some of the listed as next steps, I might try csl-websanlexicon as well to see if I understand that enough to make working changes. This is really important for a local VM to be self-sufficient, in case it works offline or in case of the main server DR situation, so a user can still refer to the scanned pages to make sure the digitized version is in sync with them.

funderburkjim commented 4 years ago

csl-websanlexicon modified

The change is very brief. Just in dictinfo.php.

Here's how to see the change in action.

I'm assuming you already have a local machine or a server set up and populated with the acc or ae dictionary installed (these are currently the only ones with scans on Github). So you have 'cologne/acc', 'cologne/ae', 'cologne/csl-websanlexicon', 'cologne/csl-pywork', 'cologne/csl-orig'.

Before updating to local images

Bring up in browser your local copy of one of the 'ae' displays (say the basic display)
Look up a word (say 'dog')
click on the page link (p=119 for 'dog')
You'll get a new tab with the image.
Inspect the image (In chrome, right-click Inspect)
You'll see src="https://www.sanskrit-lexicon.uni-koeln.de/scans/AEScan/2014/web/pdfpages/ae-119.pdf" , which proves the image is from Cologne server.

update local csl-websanlexicon

change to local version, (cd ... csl-websanlexicon)
git pull origin master

regenerate cologne/xxx/web

Install the new code at least for xxx=acc and ae.

change directory to csl-websanlexicon/v02
if you have local copies of all dictionaries:
- sh redo_xampp_all.sh

set up for local scanned images

change to cologne directory
mkdir scans

get local images for acc and ae

change to cologne/scans
git clone https://github.com/sanskrit-lexicon-scans/acc.git
git clone https://github.com/sanskrit-lexicon-scans/ae.git

test that local images are being used for acc, ae

Same steps as under 'Before updating to local images' above. But now, for example, src="http://localhost/cologne/scans/ae/pdfpages/ae-119.pdf" which proves you are using local scanned images.

If you were to use your local copy of mw, it would still show images from cologne, since there are no local images yet for mw. (i.e., scans/mw/pdfpages is not there).

funderburkjim commented 4 years ago

csl-apidev needs similar modification

csl-apidev is another piece that can be run locally. We haven't discussed it yet.
In order for local installations to use local images, It needs a modification similar to that of csl-websanlexicon. I'll open an issue related to this.

YevgenJohn commented 4 years ago

test that local images are being used for acc, ae

Same steps as under 'Before updating to local images' above. But now, for example, src="http://localhost/cologne/scans/ae/pdfpages/ae-119.pdf" which proves you are using local scanned images.

Great! It works on my local VM: <embed id="plugin" type="application/x-google-chrome-pdf" src="http://localhost/cologne/scans/ae/pdfpages/ae-119.pdf" stream-url="chrome-

If you were to use your local copy of mw, it would still show images from cologne, since there are no local images yet for mw. (i.e., scans/mw/pdfpages is not there).

I searched for 'karma' in MW, it points to: <embed id="plugin" type="application/x-google-chrome-pdf" src="http://www.sanskrit-lexicon.uni-koeln.de/scans/MWScan/MWScanpdf/mw0258-kartRguptaka.pdf" stream-url="chrome-

Do we plan to have them all in Git so they can be pulled locally? Thank you!

funderburkjim commented 4 years ago

Do we plan to have them all in Git ?

Yes. I wanted to get some feedback on the wording of the readme and the choice of license before installing github repositories in sanskrit-lexicon-scans for all the dictionaries.

drdhaval2785 commented 4 years ago

@funderburkjim

Regarding licence, I prefer GPLv3.

drdhaval2785 commented 4 years ago

Readme should give installation instructions for local images.

gasyoun commented 4 years ago

@funderburkjim

Regarding licence, I prefer GPLv3.

Makes sense to me.

funderburkjim commented 4 years ago

comparison between gplv3 and cc-by-sa.

There are several comparisons between these two licenses.

From this comparison,

GPL v3 and BY-SA 4.0 are similar licenses with similar aims. But because GPLv3 was written specifically for licensing software, it does have some differences from BY-SA ...

The main reasons I suggested the CC-BY-SA license for these scanned image repositories:

The content is not software, but data. CC-BY-SA seems more commonly used for data. For example, CC-BY-SA is said to be commonly used for Wikisource (ref.
The license for the Cologne digitizations is CC-NC-BY-SA 3.0. See the license for MW as an example.
- Note. NC=non-commercial, is used in the digitization licenses
- I dropped the 'NC' clause for these scan repositories on purpose, after experimenting with the Creative Commons License Picker. Answering 'no' to 'Allow commercial uses of your work?' , changes 'This is a Free Culture License' to 'This is not a Free Culture License'. I thought it sounded friendly to be a Free Culture License, so dropped the NC.

Given the above, I still have a slight preference for CC-BY-SA license for these repositories. Currently our software repositories (e.g. csl-pywork and a couple of others) do not have a license; if we add a license, GPLv3 might be a good choice. Another option would be MIT license.

@drdhaval2785 and @gasyoun : In light of these comments, do you have any further thoughts on the choice of license? Do you have a strong preference for the GPLv3 license for these scanned image repositories?

drdhaval2785 commented 4 years ago

Based on your comments, I am OK with CC for images and GPLv3 for csl-pywork, apidev andd websanlexicon

funderburkjim commented 4 years ago

Peter's suggestion

I asked Thomas Malten and Peter Scharf their opinion regarding license.

Thomas is fine with CC BY-SA.

Peter prefers CC BY-NC-SA. His reason:

...the scanning work was included under grants from the NEH and DFG, the license should include non-commercial as well. ... Otherwise, if someone collects money from the use of the images, these granting institutions may take offence if they don’t get a cut. The same is my feeling about the text and XML.

Here is a link to cc by-nc-sa

Here is an excerpt from https://wiki.creativecommons.org/wiki/NonCommercial_interpretation; the sentence marked off by double-asterisks is part of what I think Peter has in mind.

Like all CC licenses, the NC licenses are non-exclusive. This means that an NC licensor is free to offer the material under other terms, including on commercial terms. A frequently discussed use case for the NC licenses is a creator who wishes to allow NonCommercial use but also authorizes commercial uses in exchange for payment. (Additional permissions such as this may always be offered; licensors may also use our CC+ protocol to offer these in a standardized manner.) Also, licensees are always free to contact licensors to ask permission to use the work for commercial purposes.

My own opinion is that it doesn't matter much. I'm fine to go with cc by-nc-sa.

What do others think?

drdhaval2785 commented 4 years ago

CC BY-NC-SA is fine to me too.

funderburkjim commented 4 years ago

Will proceed with the scanned image installations under CC BY-NC-SA

Thomas also concurs with NC.
I will revise the acc and ae licenses to BY-NC-SA, and then continue with the installation of the rest of the images.

The other thing that needs to be done (@drdhaval2785 requested above) is installation instructions (i.e. how to use the scanned images in a local installation).

I'll make a 'sanskrit-lexicon-scans/documentation' repository, and make a link in the README.MD for each dictionary to the README.md in the documentation repository.

funderburkjim commented 4 years ago

Scanned images for all dictionaries

All repositories sanskrit-lexicon-scans/xxx have now been populated with the images.

sanskrit-lexicon-scans/documentation/README.md exists, but is currently incomplete.

Maybe someone else could work on this README.md. If needed, I'll provide some content next week.

gasyoun commented 4 years ago

I thought it sounded friendly to be a Free Culture License, so dropped the NC.

Exactly.

Currently our software repositories (e.g. csl-pywork and a couple of others) do not have a license

Time to add.

Do you have a strong preference for the GPLv3 license for these scanned image repositories?

No, no strong preferences. MIT is good as well.

Based on your comments, I am OK with CC for images and GPLv3 for csl-pywork, apidev andd websanlexicon

So am I.

sanskrit-lexicon-scans/xxx

It's owned only by you, Jim, right? Thinking about a case of emergency and is why I ask.

Maybe someone else could work on this README.md.

@YevgenJohn give it a try?

funderburkjim commented 4 years ago

It's owned only by you, Jim, right?

I think I 'invited' @drdhaval2785 , @YevgenJohn , and you (@gasyoun ) to the 'team' for the 'sanskrit-lexicon-scans' organization. Did you receive invitation?

Although I created the organization, my intent was to have it jointly 'owned' by all 4.

Do I need to do something in settings regarding ownership, so that I am not the only 'owner'?

YevgenJohn commented 4 years ago

Scanned images for all dictionaries

All repositories sanskrit-lexicon-scans/xxx have now been populated with the images.

Thank you very much! I'm trying to make a standalone VM with images, disconnect its network interfaces and see if links to the pictures work (as it won't be able to reach out to Cologne server).

Apologies for not contributing to the licenses discussion, as I don't know that subject well enough. Thank you!

gasyoun commented 4 years ago

Did you receive invitation?

view

Only by accident now I see it. Others are here by now.

owner

Do I need to do something in settings regarding ownership, so that I am not the only 'owner'?

Yes, for each person you set them to be a non-member, but owner.

https://github.com/orgs/sanskrit-lexicon-scans/people here?

funderburkjim commented 4 years ago

How do I change @drdhaval2785 (and others) from Member to Owner?

gasyoun commented 4 years ago

from Member to Owner?

https://help.github.com/en/github/setting-up-and-managing-organizations-and-teams/changing-a-persons-role-to-owner

funderburkjim commented 4 years ago

make a standalone VM

@YevgenJohn Why don't you start an issue regarding this standalone VM. It would be interesting to better understand what is meant by a standalone VM, and how it would be used.

YevgenJohn commented 4 years ago

Absolutely, very good idea! I wonder how much space the VM image would take with all scanned pages uploaded. I just added another disk to the VM to accommodate it. My goal is to provide a ready product linguists can plug in and use (when offline, or if they want to run heavy query which would otherwise slow shared server down, so we can remove upper limit on number of results), as asking them to do Linux commands to set it up locally seems a bit of impractical to me. Thank you!

drdhaval2785 commented 2 years ago

Local scanned images have stabilized. Closing the issue.

sanskrit-lexicon / csl-pywork

local scanned images #10

size estimations

Notes

Github is a viable location for keeping the images

Proposed 34 repository solution

Proposed repository naming convention

Proposed github project to contain the scan repositories

request feedback

sanskrit-lexicon-scans organization

sanskrit-lexicon-scans/acc

sanskrit-lexicon-scans/ae

Next steps

csl-websanlexicon modified

Before updating to local images

update local csl-websanlexicon

regenerate cologne/xxx/web

set up for local scanned images

get local images for acc and ae

test that local images are being used for acc, ae

csl-apidev needs similar modification

test that local images are being used for acc, ae

comparison between gplv3 and cc-by-sa.

Peter's suggestion

Will proceed with the scanned image installations under CC BY-NC-SA

Scanned images for all dictionaries

Scanned images for all dictionaries