huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.03k stars 2.63k forks source link

Stream dataset does not iterate if the batch size is larger than the dataset size (related to drop_last_batch) #7113

Closed memray closed 4 weeks ago

memray commented 1 month ago

Describe the bug

Hi there,

I use streaming and interleaving to combine multiple datasets saved in jsonl files. The size of dataset can vary (from 100ish to 100k-ish). I use dataset.map() and a big batch size to reduce the IO cost. It was working fine with datasets-2.16.1 but this problem shows up after I upgraded to datasets-2.19.2. With 2.21.0 the problem remains.

Please see the code below to reproduce the problem.

The dataset can iterate correctly if we set either streaming=False or drop_last_batch=False.

I have to use drop_last_batch=True since it's for distributed training.

Steps to reproduce the bug

# datasets==2.21.0
import datasets
def data_prepare(examples):
    print(examples["sentence1"][0])
    return examples

batch_size = 101
# the size of the dataset is 100
# the dataset iterates correctly if we set either streaming=False or drop_last_batch=False 
dataset = datasets.load_dataset("mteb/biosses-sts", split="test", streaming=True)
dataset = dataset.map(lambda x: data_prepare(x),
                      drop_last_batch=True,
                      batched=True, batch_size=batch_size)
for ex in dataset:
    print(ex)
    pass

Expected behavior

The dataset iterates regardless of the batch size.

Environment info

lhoestq commented 1 month ago

That's expected behavior, it's also the same in torch:

>>> list(DataLoader(list(range(5)), batch_size=10, drop_last=True))
[]