huggingface / chug

Minimal sharded dataset loaders, decoders, and utils for multi-modal document, image, and text datasets.
Apache License 2.0
139 stars 9 forks source link

[feature] Add dataloader support for non-webdataset dataset #2

Closed molbap closed 4 months ago

molbap commented 11 months ago

Currently in pixparse other dataloaders are defined by e.g.


elif cfg.format == "hf_dataset":
        # In the case of hf datasets, we use the collator defined at task level
        dataset = load_dataset(cfg.source)[cfg.split]
        training_sampler = DistributedSampler(
            dataset, rank=global_rank, shuffle=True, seed=seed, num_replicas=world_size, drop_last=True
        )
        if is_train:
            # create a shared epoch store to sync epoch to dataloader worker proc
            shared_interval_count = SharedCount(count=start_interval)
        else:
            shared_interval_count = None
        num_batches = len(dataset) // cfg.batch_size
        base_loader = DataLoader(
            dataset=dataset, 
            collate_fn=collate_fn,
            sampler=training_sampler, 
            batch_size=cfg.batch_size, 
            num_workers=cfg.num_workers,
            )
        loader = LoaderBundle(
        loader=base_loader,
        num_batches=num_batches,
        num_samples=cfg.num_samples,
        shared_interval=shared_interval_count,
    )
    return loader

Instead of having this util in pixparse, we can write it here to handle batch creation at a lower level, and then use chug normally from pixparse lib.

rwightman commented 4 months ago

HF datasets support all in chug now. A big ? if we add support for other forms such as csv / file-folder directly or focus on webdataset, HF datasets, and possibly other sharded formats that we create to address specific needs / performance...