tomekkorbak / pretraining-with-human-feedback

Code accompanying the paper Pretraining Language Models with Human Preferences
https://arxiv.org/abs/2302.08582
MIT License
177 stars 14 forks source link

Code doesn't run due to shuffle=True exception #7

Open RylanSchaeffer opened 1 year ago

RylanSchaeffer commented 1 year ago

I"m getting some weird error when using the default for the dataloader of shuffle=True. Can you please help me debug why this is occurring?

Traceback (most recent call last):
  File "/lfs/hyperturing2/0/rschaef/KoyejoLab-Pretrain-Human-Feedback/train.py", line 163, in <module>
    train(args.checkpoint_path, config=config)
  File "/lfs/hyperturing2/0/rschaef/KoyejoLab-Pretrain-Human-Feedback/train.py", line 139, in train
    trainer.train(resume_from_checkpoint=checkpoint_path)
  File "/lfs/hyperturing2/0/rschaef/miniconda3/envs/pretrain_hf/lib/python3.9/site-packages/transformers/trainer.py", line 1196, in train
    train_dataloader = self.get_train_dataloader()
  File "/lfs/hyperturing2/0/rschaef/KoyejoLab-Pretrain-Human-Feedback/apo/trainer.py", line 118, in get_train_dataloader
    return DataLoader(
  File "/lfs/hyperturing2/0/rschaef/miniconda3/envs/pretrain_hf/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 228, in __init__
python-BaseException
    raise ValueError(
ValueError: DataLoader with IterableDataset: expected unspecified shuffle option, but got shuffle=True
RylanSchaeffer commented 1 year ago

Evidence that shuffle=True is in the repo:

https://github.com/tomekkorbak/pretraining-with-human-feedback/blob/master/apo/trainer.py#L124

Evidence that setting shuffle=True throws an error:

        if isinstance(dataset, IterableDataset):
            self._dataset_kind = _DatasetKind.Iterable
            # NOTE [ Custom Samplers and IterableDataset ]
            #
            # `IterableDataset` does not support custom `batch_sampler` or
            # `sampler` since the key is irrelevant (unless we support
            # generator-style dataset one day...).
            #
            # For `sampler`, we always create a dummy sampler. This is an
            # infinite sampler even when the dataset may have an implemented
            # finite `__len__` because in multi-process data loading, naive
            # settings will return duplicated data (which may be desired), and
            # thus using a sampler with length matching that of dataset will
            # cause data lost (you may have duplicates of the first couple
            # batches, but never see anything afterwards). Therefore,
            # `Iterabledataset` always uses an infinite sampler, an instance of
            # `_InfiniteConstantSampler` defined above.
            #
            # A custom `batch_sampler` essentially only controls the batch size.
            # However, it is unclear how useful it would be since an iterable-style
            # dataset can handle that within itself. Moreover, it is pointless
            # in multi-process data loading as the assignment order of batches
            # to workers is an implementation detail so users can not control
            # how to batchify each worker's iterable. Thus, we disable this
            # option. If this turns out to be useful in future, we can re-enable
            # this, and support custom samplers that specify the assignments to
            # specific workers.
            if shuffle is not False:
                raise ValueError(
                    "DataLoader with IterableDataset: expected unspecified "
                    "shuffle option, but got shuffle={}".format(shuffle))