sparkfish / shabby-pages

ShabbyPages is a state-of-the-art corpus of born-digital document images with both ground truth and distorted versions appropriate for use in training models to reverse distortions and recover to original denoised documents.
MIT License
50 stars 6 forks source link

Generate PNGs from PDFs #1

Closed proofconstruction closed 2 years ago

proofconstruction commented 2 years ago

We now have 600 "born-digital" source PDFs, many with multiple pages. We need to split these out into PDFs of each individual page, then convert all pages to PNG.

proofconstruction commented 2 years ago

poppler_utils came in handy: for i in $(ls); do pdftoppm $i $i -png; done;

proofconstruction commented 2 years ago

On a whim I ran ls | grep pdf | time parallel pdftoppm {} 300dpi/{} -png -r 300 and got the following:

parallel pdftoppm {} 300dpi/{} -png -r 300 7053.63s user 38.89s system 2121% cpu 5:34.29 total

I love GNU Parallel. 3.6GB of images generated in just a few minutes.

jboarman commented 1 year ago

@kwcckw Can you add a bash script to the repo that performs this action given an input directory? We can note the pdftoppm can be installed in various ways, but apt-get update && apt-get install -y poppler-utils is how one would do it on Ubuntu. It won't work for everyone, but it would provide a starting point and someone could commit a Windows way of doing this later if they wanted to.

kwcckw commented 1 year ago

@kwcckw Can you add a bash script to the repo that performs this action given an input directory? We can note the pdftoppm can be installed in various ways, but apt-get update && apt-get install -y poppler-utils is how one would do it on Ubuntu. It won't work for everyone, but it would provide a starting point and someone could commit a Windows way of doing this later if they wanted to.

Alright, i will add a script to convert pdf to png in both linux and window later. I can test for both window and linux OS in paperspace, except for mac OS.

kwcckw commented 1 year ago

I added the python code in this pull request https://github.com/sparkfish/shabby-pages/pull/73 to convert pdf into images. I tested it and it should work in both window and linux OS, and it can be run from the terminal too.