sparkfish / shabby-pages

ShabbyPages is a state-of-the-art corpus of born-digital document images with both ground truth and distorted versions appropriate for use in training models to reverse distortions and recover to original denoised documents.
MIT License
50 stars 6 forks source link

Grayscale & resize PNGs #2

Closed proofconstruction closed 2 years ago

proofconstruction commented 2 years ago

We need to figure out which resolution we'd like to standardize on, but once we do, all the PNGs generated from #1 need to get grayscaled and sized to that resolution.

proofconstruction commented 2 years ago

Well, the documentation here says the default setting is 150dpi. I didn't specify a resolution when running pdftoppm for #1 , so the 150dpi version of this dataset is 1.9G uncompressed.

proofconstruction commented 2 years ago

The smallest image is 502x502 pixels while the largest is 12157x3336.

The documents don't have to be visually appealing for analysis, so I think we're fine to resize them to 500x500 or lower. Not sure what the best-practices are here.

jboarman commented 2 years ago

If we were doing image classification only, then 500x500 would be great. But for this exercise, I think we should maintain close to a realistic resolution and aspect ratio.

We could kick out documents that don't adhere to the common portrait orientation and 8"x11" size. On the other hand, non-English documents in the corpus means we will have non-Letter sized documents. We could try rotating/resizing to a common page size instead of rejecting landscape or A4 sizes, etc.

We may have to settle on 200-300 DPI for this first release of the dataset, just to keep the storage/bandwidth requirements reasonable. We can consider releasing additional higher-res datasets later if we run out of time to get it done in this round.

proofconstruction commented 2 years ago

With the current 150dpi images, we're looking at 1275x1650 resolution for an 8.5"x11" Letter sheet. There are 2048 images smaller than this in at least one dimension, leaving a little over 4000 left that could be scaled down to that.

We could export 200-300dpi images from the PDFs and then scale them down to that size.

jboarman commented 2 years ago

If a manual review of those smaller pages indicates that we should save many or most of those shorter dimensions, then we could lower our threshold and pad those images on the dimensions that fall short.

For images that are slightly larger, we could probably shrink those within some reasonable tolerance (maybe like up to 20% shrinkage is OK).

proofconstruction commented 2 years ago

We're staying on 150dpi for this run. Grayscaling and resizing/fitting code is in the build directory of the repo.