Closed molbap closed 1 year ago
I tested this in a training run, and it seems to work. For samples without text, we have now
2023-07-26,15:18:55 | ERROR | Issue processing annotation for pipe:aws s3 cp s3://<dataset_adress>/shard-xxx.tar -, <shard_key>.
2023-07-26,15:18:55 | WARNING | Handling webdataset error (RuntimeError('No non-empty page found after 10 attempts')). Ignoring.
However due to an initial bug (solved) I noticed current webdataset loader was only handling single-page samples, referenced here https://github.com/huggingface/chug/issues/1
@molbap so, aware of the chug issue, but not sure how that's an issue here? regardless of whether the doc itself is multi-page, we're still only returning one page no? that FIXME only needs to be sorted out when we return multiple pages from multi-page docs and there's a fair bit more that needs to be figured out besides the dataloading before that will work well...
related to the PR itself, this looks good
Yes the chug tensor -> list is not an issue here, just something to keep in mind. I'll merge that one then
What does this PR do?