Closed proofconstruction closed 2 years ago
Well, the documentation here says the default setting is 150dpi. I didn't specify a resolution when running pdftoppm
for #1 , so the 150dpi version of this dataset is 1.9G uncompressed.
The smallest image is 502x502
pixels while the largest is 12157x3336
.
The documents don't have to be visually appealing for analysis, so I think we're fine to resize them to 500x500 or lower. Not sure what the best-practices are here.
If we were doing image classification only, then 500x500 would be great. But for this exercise, I think we should maintain close to a realistic resolution and aspect ratio.
We could kick out documents that don't adhere to the common portrait orientation and 8"x11" size. On the other hand, non-English documents in the corpus means we will have non-Letter sized documents. We could try rotating/resizing to a common page size instead of rejecting landscape or A4 sizes, etc.
We may have to settle on 200-300 DPI for this first release of the dataset, just to keep the storage/bandwidth requirements reasonable. We can consider releasing additional higher-res datasets later if we run out of time to get it done in this round.
With the current 150dpi images, we're looking at 1275x1650 resolution for an 8.5"x11" Letter sheet. There are 2048 images smaller than this in at least one dimension, leaving a little over 4000 left that could be scaled down to that.
We could export 200-300dpi images from the PDFs and then scale them down to that size.
If a manual review of those smaller pages indicates that we should save many or most of those shorter dimensions, then we could lower our threshold and pad those images on the dimensions that fall short.
For images that are slightly larger, we could probably shrink those within some reasonable tolerance (maybe like up to 20% shrinkage is OK).
We're staying on 150dpi for this run. Grayscaling and resizing/fitting code is in the build
directory of the repo.
We need to figure out which resolution we'd like to standardize on, but once we do, all the PNGs generated from #1 need to get grayscaled and sized to that resolution.