sparkfish / shabby-pages

ShabbyPages is a state-of-the-art corpus of born-digital document images with both ground truth and distorted versions appropriate for use in training models to reverse distortions and recover to original denoised documents.
MIT License
48 stars 6 forks source link

Where to download the dataset? #70

Closed saifullah3396 closed 1 year ago

saifullah3396 commented 1 year ago

Hi,

Great work! Where can I download the train/test sets that were already generated in the paper?

jboarman commented 1 year ago

Hi @saifullah3396, we’re still in pre-print status on this paper and the associated dataset, but we are hoping to be ready to release during the summer.

Can you share more about your interests or how you plan to use the project?

saifullah3396 commented 1 year ago

Hey @jboarman thanks for your response!

I mainly plan to use it for large-scale pertaining of generative denoising models for binarization or corruption removal.

saifullah3396 commented 1 year ago

I am actually curious about the particular reason why you cannot share the dataset under preprint status? I am thinking if sharing the dataset privately under some sort of agreement would be alright for research purposes? I have actually been looking for large-scale datasets for tasks such as binarization / learning corruption styles with fixed train/test sets so that we don't have to resort to very small datasets like DIBCO, LDM, etc. The earlier the better!

jboarman commented 1 year ago

Thanks @saifullah3396. You make a compelling argument. We'll work to get this available as soon as possible.