huggingface / chug

Minimal sharded dataset loaders, decoders, and utils for multi-modal document, image, and text datasets.
Apache License 2.0
139 stars 9 forks source link

multi-page images are returned as list instead of torch.Tensor #1

Closed molbap closed 4 months ago

molbap commented 1 year ago

In doc_anno_pipe: https://github.com/huggingface/chug/blob/cfb16882e1058b37871b61fe8f76830cef3d8750/src/chug/webdataset/doc_anno_pipe.py#L172C1-L178C23

If there is just one page, it will be squeezed into a tensor, else, it will become a list, breaking the interface afterwards. solution: always return a tensor

rwightman commented 4 months ago

I've allowed multi-page docs as lists when batching is disabled (batch_size=None). This is nice for doing preview, analysis via the loaders. Set page_sampling='all' / 'all_valid' as well.

For batched, tensor output we currently need to select a single-page output, e.g we set page_sampling to a mode that outputs one page. I have a partially working mode where you can enable 'expansion' and multi-page sampling and it will expand multiple pages into multiple samples (one page + its annotation per sample). To support multi-page into ONE sample, we need to add support for more advanced tokenization schemes, interleaving of image + text tokens, additional marker tokens, etc.