Open waldoj opened 11 years ago
It turns out that the Internet Archive and the Library of Congress both provide some useful information about their own microfilm scanning processes. Also, there's the Federal Agencies Digitization Guidelines Initiative, a whole website dedicated to this topic.
The short version is that lossless compression is the only way to fly, and TIFF and JPEG 2000 are the acceptable formats. I grabbed a sample scan of a single microfilm image (a page from a newspaper), downsampled it to 300 dpi, and saved it in a variety of formats:
Format | Size |
---|---|
TIFF | 7.3 MB |
TIFF LZW | 2 MB |
TIFF ZIP | 1.9 MB |
JPEG 2000 | 4.7 MB |
These were decidedly not the results that I expected. I anticipated that lossless JPEG 2000 would be smaller than TIFF LZW, but of course it's not even close. ZIP compression is slow, both when reading and writing, and it's not universally supported, so I suspect it's not worth the 5% decrease in file size.
Of course, a large-scale test on the actual microfilm in question is in order. I expect that file sizes will vary widely, probably trending downward from this sample, since the image of a newspaper is more data-rich than a handwritten page or, certainly, the nearly blank divider images that are between each document on the microfilm for some presidential papers.
Find somebody who has dealt with these problems at scale before to determine things like resolution, number of passes, resolution, grayscale versus color, compression, and file format.