miller-center / cpc-issues

Connecting Presidential Collections
Other
0 stars 0 forks source link

List possible types of metadata to be extracted from scans #5

Open waldoj opened 10 years ago

waldoj commented 10 years ago

To get started:

waldoj commented 10 years ago

Image density. That is, what percentage of the pixels are white, and what percentage are non-white? I forecast that we'll find that the pages in a given letter tend have the same density. But I also worry that the range won't be great enough to be able to use that information to know where one document stops and another one starts.

It's worth trying this with a histogram, too. Obviously, that's more complicated than a simple black/white calculation.

waldoj commented 10 years ago

Overlaid metadata. Some page scans include things like page numbers in the bottom corner, or origin labels in the top. This is important to include, not least of which because it allows us to match up scans with finding aids.