Open funderburkjim opened 4 years ago
When a display wants to present the scanned image of a particular page in a particular dictionary (such as page 346 for MW), it consults the 'pdffiles.txt' file for that dictionary. All these files may be found in this csl-websanlexicon. For the mw example, the file is csl-websanlexicon/v02/distinctfiles/mw/web/webtc/pdffiles.txt.
This file has 3 colon-delimited fields.
The getfiles function within servepdfClass.php in csl-apidev repository shows how the identifier field within the lines of pdffiles.txt is used to get the filename for the body pages.
For each dictionary, this table shows:
dictionary | #vp | #p | #other | img-type |
---|---|---|---|---|
acc | 1191 | 0 | 25 | |
ae | 0 | 501 | 0 | |
ap90 | 0 | 1196 | 15 | |
ben | 0 | 1127 | 0 | |
bhs | 0 | 676 | 11 | |
bop | 0 | 407 | 0 | |
bor | 0 | 783 | 25 | |
bur | 0 | 414 | 16 | |
cae | 0 | 672 | 5 | |
ccs | 0 | 541 | 0 | png |
gra | 0 | 888 | 5 | |
gst | 0 | 320 | 0 | |
ieg | 0 | 560 | 20 | |
inm | 0 | 787 | 65 | |
krm | 0 | 1489 | 0 | |
mci | 0 | 981 | 43 | |
md | 0 | 384 | 11 | jpg |
mw | 0 | 1333 | 36 | |
mw72 | 0 | 1186 | 26 | |
mwe | 0 | 860 | 0 | |
pe | 0 | 900 | 29 | |
pgn | 0 | 378 | 42 | |
pui | 2192 | 0 | 40 | |
pw | 0 | 1922 | 219 | png |
pwg | 4737 | 0 | 0 | |
sch | 0 | 395 | 0 | |
shs | 0 | 839 | 3 | |
skd | 3164 | 0 | 0 | |
snp | 0 | 133 | 2 | |
stc | 0 | 893 | 0 | |
vcp | 0 | 5407 | 0 | |
vei | 1054 | 0 | 101 | |
wil | 0 | 982 | 6 | jpg |
yat | 0 | 924 | 4 | |
ALL | 12338 | 27878 | 749 |
The page part of the identifier (of body pages) is a digit sequence representing an integer.
In the case of mw, the first 36 lines of pdffiles.txt refer to non-body pages; ignoring those, there remain 1333 body pages, with the page numbering starting at 1 and continuing through 1333. So the body page identifiers are simple.
Here are 20 dictionaries whose body page identifiers are similarly simple:
dictionary | first | last | number |
---|---|---|---|
ae | 1 | 501 | 501 |
ap90 | 1 | 1196 | 1196 |
ben | 1 | 1127 | 1127 |
bop | 1 | 407 | 407 |
bor | 1 | 783 | 783 |
cae | 1 | 672 | 672 |
ccs | 1 | 541 | 541 |
gst | 1 | 320 | 320 |
ieg | 1 | 560 | 560 |
inm | 1 | 787 | 787 |
krm | 1 | 1489 | 1489 |
mci | 1 | 981 | 981 |
md | 1 | 384 | 384 |
mw | 1 | 1333 | 1333 |
mw72 | 1 | 1186 | 1186 |
mwe | 1 | 860 | 860 |
pe | 1 | 900 | 900 |
pgn | 1 | 378 | 378 |
shs | 1 | 839 | 839 |
wil | 1 | 982 | 982 |
After the changes to pdffiles.txt for bhs and bur mentioned in #12,
bhs is now also simple (and can be added to above table).
bhs 1 623 623
6 other dictionaries are in one volume; the page sequencing can be summarized:
dictionary | #scans | first | last | increment | missing |
---|---|---|---|---|---|
bur | 378 | 4 | 758 | 2 | |
gra | 888 | 1 | 1775 | 2 | |
sch | 395 | 1 | 396 | 1 | 380 |
stc | 893 | 1 | 894 | 1 | 580 |
vcp | 5407 | 35 | 5441 | 1 | |
yat | 924 | 1 | 928 | 1 | 924,925,926,927 |
6 of the dictionaries have scanned images in multiple volumes, and the file identifiers are of the form v-p (volume and page separated by '-' character). Within each volume, the pagination is simple (in the sense of above). Note that the 'page-increment' is '2' for pwg (i.e., there are 2 pages per image).
acc regular vol-page pattern in 3 volumes
acc part1: 795 1 795 1
acc part2: 237 1 237 1
acc part3: 159 1 159 1
pui regular vol-page pattern in 3 volumes
pui part1: 660 1 660 1
pui part2: 746 1 746 1
pui part3: 786 1 786 1
pw page pattern in 7 parts
pw part1: 282 1001 1282 1
pw part2: 284 2001 2284 1
pw part3: 246 3001 3246 1
pw part4: 290 4001 4290 1
pw part5: 240 5001 5240 1
pw part6: 292 6001 6292 1
pw part7: 288 7001 7288 1
pwg regular vol-page pattern in 7 volumes
pwg part1: 571 1 1141 2
pwg part2: 550 1 1099 2
pwg part3: 506 1 1011 2
pwg part4: 607 1 1213 2
pwg part5: 839 1 1677 2
pwg part6: 753 1 1505 2
pwg part7: 911 1 1821 2
skd regular vol-page pattern in 5 volumes
skd part1: 315 1 315 1
skd part2: 937 1 937 1
skd part3: 792 1 792 1
skd part4: 565 1 565 1
skd part5: 555 1 555 1
snp page pattern in two parts
snp part1: 92 520 611 1
snp part2: 41 425 465 1
vei regular vol-page pattern in 2 volumes
vei part1: 544 1 544 1
vei part2: 510 1 510 1
The page identifiers for pw are of form 'vppp' (e.g.2035), which is to be interpreted as 'page ppp in volume v'. The pagination is simple within each volume.
pw page pattern in 7 parts
pw part1: 282 1001 1282 1
pw part2: 284 2001 2284 1
pw part3: 246 3001 3246 1
pw part4: 290 4001 4290 1
pw part5: 240 5001 5240 1
pw part6: 292 6001 6292 1
pw part7: 288 7001 7288 1
This completes what I can think of now regarding the naming conventions of the scanned images.
Programs used in the analysis are here
This is in response to an inquiry from @vocabulista . He has the need to rename all the scanned image file names in a consistent, flat, way.
First, the scanned images for each dictionary can be separated into two piles:
The following comments pertain only to the first pile.