sanskrit-lexicon / csl-websanlexicon

0 stars 1 forks source link

Naming Conventions of the Scanned Images (Files) #11

Open funderburkjim opened 4 years ago

funderburkjim commented 4 years ago

This is in response to an inquiry from @vocabulista . He has the need to rename all the scanned image file names in a consistent, flat, way.

First, the scanned images for each dictionary can be separated into two piles:

The following comments pertain only to the first pile.

funderburkjim commented 4 years ago

pdffiles.txt

When a display wants to present the scanned image of a particular page in a particular dictionary (such as page 346 for MW), it consults the 'pdffiles.txt' file for that dictionary. All these files may be found in this csl-websanlexicon. For the mw example, the file is csl-websanlexicon/v02/distinctfiles/mw/web/webtc/pdffiles.txt.

This file has 3 colon-delimited fields.

servepdf.php

The getfiles function within servepdfClass.php in csl-apidev repository shows how the identifier field within the lines of pdffiles.txt is used to get the filename for the body pages.

funderburkjim commented 4 years ago

Summary table

For each dictionary, this table shows:

dictionary #vp #p #other img-type
acc 1191 0 25 pdf
ae 0 501 0 pdf
ap90 0 1196 15 pdf
ben 0 1127 0 pdf
bhs 0 676 11 pdf
bop 0 407 0 pdf
bor 0 783 25 pdf
bur 0 414 16 pdf
cae 0 672 5 pdf
ccs 0 541 0 png
gra 0 888 5 pdf
gst 0 320 0 pdf
ieg 0 560 20 pdf
inm 0 787 65 pdf
krm 0 1489 0 pdf
mci 0 981 43 pdf
md 0 384 11 jpg
mw 0 1333 36 pdf
mw72 0 1186 26 pdf
mwe 0 860 0 pdf
pe 0 900 29 pdf
pgn 0 378 42 pdf
pui 2192 0 40 pdf
pw 0 1922 219 png
pwg 4737 0 0 pdf
sch 0 395 0 pdf
shs 0 839 3 pdf
skd 3164 0 0 pdf
snp 0 133 2 pdf
stc 0 893 0 pdf
vcp 0 5407 0 pdf
vei 1054 0 101 pdf
wil 0 982 6 jpg
yat 0 924 4 pdf
ALL 12338 27878 749
funderburkjim commented 4 years ago

page identifier analysis

The page part of the identifier (of body pages) is a digit sequence representing an integer.

In the case of mw, the first 36 lines of pdffiles.txt refer to non-body pages; ignoring those, there remain 1333 body pages, with the page numbering starting at 1 and continuing through 1333. So the body page identifiers are simple.

simple body page identifiers

Here are 20 dictionaries whose body page identifiers are similarly simple:

dictionary first last number
ae 1 501 501
ap90 1 1196 1196
ben 1 1127 1127
bop 1 407 407
bor 1 783 783
cae 1 672 672
ccs 1 541 541
gst 1 320 320
ieg 1 560 560
inm 1 787 787
krm 1 1489 1489
mci 1 981 981
md 1 384 384
mw 1 1333 1333
mw72 1 1186 1186
mwe 1 860 860
pe 1 900 900
pgn 1 378 378
shs 1 839 839
wil 1 982 982
funderburkjim commented 4 years ago

bhs is also 'simple'

After the changes to pdffiles.txt for bhs and bur mentioned in #12, bhs is now also simple (and can be added to above table). bhs 1 623 623

funderburkjim commented 4 years ago

The other 'page' identifier dictionaries

6 other dictionaries are in one volume; the page sequencing can be summarized:

dictionary #scans first last increment missing
bur 378 4 758 2
gra 888 1 1775 2
sch 395 1 396 1 380
stc 893 1 894 1 580
vcp 5407 35 5441 1
yat 924 1 928 1 924,925,926,927
funderburkjim commented 4 years ago

The 'volume-page' identifier dictionaries

6 of the dictionaries have scanned images in multiple volumes, and the file identifiers are of the form v-p (volume and page separated by '-' character). Within each volume, the pagination is simple (in the sense of above). Note that the 'page-increment' is '2' for pwg (i.e., there are 2 pages per image).

acc  regular vol-page pattern in 3 volumes
   acc   part1: 795 1 795 1
   acc   part2: 237 1 237 1
   acc   part3: 159 1 159 1
pui  regular vol-page pattern in 3 volumes
   pui   part1: 660 1 660 1
   pui   part2: 746 1 746 1
   pui   part3: 786 1 786 1
pw page pattern in 7 parts
   pw   part1: 282 1001 1282 1
   pw   part2: 284 2001 2284 1
   pw   part3: 246 3001 3246 1
   pw   part4: 290 4001 4290 1
   pw   part5: 240 5001 5240 1
   pw   part6: 292 6001 6292 1
   pw   part7: 288 7001 7288 1
pwg  regular vol-page pattern in 7 volumes
   pwg   part1: 571 1 1141 2
   pwg   part2: 550 1 1099 2
   pwg   part3: 506 1 1011 2
   pwg   part4: 607 1 1213 2
   pwg   part5: 839 1 1677 2
   pwg   part6: 753 1 1505 2
   pwg   part7: 911 1 1821 2
skd  regular vol-page pattern in 5 volumes
   skd   part1: 315 1 315 1
   skd   part2: 937 1 937 1
   skd   part3: 792 1 792 1
   skd   part4: 565 1 565 1
   skd   part5: 555 1 555 1
snp page pattern in two parts
   snp   part1: 92 520 611 1
   snp   part2: 41 425 465 1
vei  regular vol-page pattern in 2 volumes
   vei   part1: 544 1 544 1
   vei   part2: 510 1 510 1
funderburkjim commented 4 years ago

pw is 'pseudo' volume-page

The page identifiers for pw are of form 'vppp' (e.g.2035), which is to be interpreted as 'page ppp in volume v'. The pagination is simple within each volume.

pw page pattern in 7 parts
   pw   part1: 282 1001 1282 1
   pw   part2: 284 2001 2284 1
   pw   part3: 246 3001 3246 1
   pw   part4: 290 4001 4290 1
   pw   part5: 240 5001 5240 1
   pw   part6: 292 6001 6292 1
   pw   part7: 288 7001 7288 1
funderburkjim commented 4 years ago

This completes what I can think of now regarding the naming conventions of the scanned images.

Programs used in the analysis are here