miller-center / cpc-issues

Connecting Presidential Collections
Other
0 stars 0 forks source link

Research and establish standards for scanning #4

Open waldoj opened 10 years ago

waldoj commented 10 years ago

Find somebody who has dealt with these problems at scale before to determine things like resolution, number of passes, resolution, grayscale versus color, compression, and file format.

waldoj commented 10 years ago

It turns out that the Internet Archive and the Library of Congress both provide some useful information about their own microfilm scanning processes. Also, there's the Federal Agencies Digitization Guidelines Initiative, a whole website dedicated to this topic.

The short version is that lossless compression is the only way to fly, and TIFF and JPEG 2000 are the acceptable formats. I grabbed a sample scan of a single microfilm image (a page from a newspaper), downsampled it to 300 dpi, and saved it in a variety of formats:

FormatSize
TIFF7.3 MB
TIFF LZW2 MB
TIFF ZIP1.9 MB
JPEG 20004.7 MB

These were decidedly not the results that I expected. I anticipated that lossless JPEG 2000 would be smaller than TIFF LZW, but of course it's not even close. ZIP compression is slow, both when reading and writing, and it's not universally supported, so I suspect it's not worth the 5% decrease in file size.

Of course, a large-scale test on the actual microfilm in question is in order. I expect that file sizes will vary widely, probably trending downward from this sample, since the image of a newspaper is more data-rich than a handwritten page or, certainly, the nearly blank divider images that are between each document on the microfilm for some presidential papers.