sparkfish / shabby-pages

ShabbyPages is a state-of-the-art corpus of born-digital document images with both ground truth and distorted versions appropriate for use in training models to reverse distortions and recover to original denoised documents.
MIT License
48 stars 6 forks source link

Will the model work on document images in other languages like Chinese? #82

Closed Yikai-Liao closed 10 months ago

Yikai-Liao commented 10 months ago

I just wonder about the generalization ability of the model.

Also, if there are some charts and iconographs in the input document images,would the model work well?

Yikai-Liao commented 10 months ago

it's my fault,it' just a dataset

gxlarson commented 10 months ago

The dataset was created using Augraphy. If you wanted to build off of the Shabby Pages datset (to add more samples with charts/iconographs) or create a new one specifically for charts and iconographs, you would just have to find images/documents with these and then use Augraphy to create a noisy version.