Feat/issue 2/find nonempty pages

huggingface / pixparse

Pixel Parsing. A reproduction of OCR-free end-to-end document understanding models with open data

11 stars 3 forks source link

Feat/issue 2/find nonempty pages #3

Closed molbap closed 1 year ago

molbap commented 1 year ago

What does this PR do?

Resolves #2
Add an index generator to iterate through valid pages for multi-page input

molbap commented 1 year ago

I tested this in a training run, and it seems to work. For samples without text, we have now

2023-07-26,15:18:55 | ERROR | Issue processing annotation for pipe:aws s3 cp s3://<dataset_adress>/shard-xxx.tar -, <shard_key>.
2023-07-26,15:18:55 | WARNING | Handling webdataset error (RuntimeError('No non-empty page found after 10 attempts')). Ignoring.

However due to an initial bug (solved) I noticed current webdataset loader was only handling single-page samples, referenced here https://github.com/huggingface/chug/issues/1

rwightman commented 1 year ago

@molbap so, aware of the chug issue, but not sure how that's an issue here? regardless of whether the doc itself is multi-page, we're still only returning one page no? that FIXME only needs to be sorted out when we return multiple pages from multi-page docs and there's a fair bit more that needs to be figured out besides the dataloading before that will work well...

related to the PR itself, this looks good

molbap commented 1 year ago

Yes the chug tensor -> list is not an issue here, just something to keep in mind. I'll merge that one then