huggingface / pixparse

Pixel Parsing. A reproduction of OCR-free end-to-end document understanding models with open data
11 stars 3 forks source link

Support multi-page or valid page selection #2

Closed molbap closed 1 year ago

molbap commented 1 year ago

Current state selects randomly a page among the existing ones.

https://github.com/huggingface/pixparse/blob/64257d721fb88600def794382f1736f36d98878c/src/pixparse/data/preprocess.py#L72C2-L80C63

However some pages are empty (do not contain text) and are invalid samples. As a step towards flexible multipage handling, this should only select valid pages.