sparkfish / shabby-pages

ShabbyPages is a state-of-the-art corpus of born-digital document images with both ground truth and distorted versions appropriate for use in training models to reverse distortions and recover to original denoised documents.
MIT License
50 stars 6 forks source link

Add missing script that generates the clean / dirty split #72

Closed jboarman closed 1 year ago

jboarman commented 1 year ago

I may have just missed it, but I don't see the script that generates the pages from the PDFs. We chose 150 DPI output, for example. If we wanted to regenerate the dataset at a different resolution or with new PDF sources, we would need this script.

proofconstruction commented 1 year ago

The way I did it requires poppler-utils: pdftoppm document.pdf some_name -r preferred_resolution -png

Generating the clean/dirty split is another matter. I believe @kwcckw has the code for this, maybe in a notebook.

jboarman commented 1 year ago

That's awesome that you tracked that issue right here in GH! 👍

kwcckw commented 1 year ago

To generate clean/dirty split, we require a dataset with clean & dirty images. so do we have the dataset here? Or it will be a general script to do so?

jboarman commented 1 year ago

This should be a general script since shabby is more about creating a repeatable recipe than a specific dataset.

kwcckw commented 1 year ago

I added the code in this pull request: https://github.com/sparkfish/shabby-pages/pull/73 and this should be resolved now.