Closed jboarman closed 1 year ago
The way I did it requires poppler-utils
: pdftoppm document.pdf some_name -r preferred_resolution -png
Generating the clean/dirty split is another matter. I believe @kwcckw has the code for this, maybe in a notebook.
That's awesome that you tracked that issue right here in GH! 👍
To generate clean/dirty split, we require a dataset with clean & dirty images. so do we have the dataset here? Or it will be a general script to do so?
This should be a general script since shabby is more about creating a repeatable recipe than a specific dataset.
I added the code in this pull request: https://github.com/sparkfish/shabby-pages/pull/73 and this should be resolved now.
I may have just missed it, but I don't see the script that generates the pages from the PDFs. We chose 150 DPI output, for example. If we wanted to regenerate the dataset at a different resolution or with new PDF sources, we would need this script.