Closed molbap closed 1 year ago
Current state selects randomly a page among the existing ones.
https://github.com/huggingface/pixparse/blob/64257d721fb88600def794382f1736f36d98878c/src/pixparse/data/preprocess.py#L72C2-L80C63
However some pages are empty (do not contain text) and are invalid samples. As a step towards flexible multipage handling, this should only select valid pages.
Current state selects randomly a page among the existing ones.
https://github.com/huggingface/pixparse/blob/64257d721fb88600def794382f1736f36d98878c/src/pixparse/data/preprocess.py#L72C2-L80C63
However some pages are empty (do not contain text) and are invalid samples. As a step towards flexible multipage handling, this should only select valid pages.