I use streaming and interleaving to combine multiple datasets saved in jsonl files. The size of dataset can vary (from 100ish to 100k-ish). I use dataset.map() and a big batch size to reduce the IO cost. It was working fine with datasets-2.16.1 but this problem shows up after I upgraded to datasets-2.19.2. With 2.21.0 the problem remains.
Please see the code below to reproduce the problem.
The dataset can iterate correctly if we set either streaming=False or drop_last_batch=False.
I have to use drop_last_batch=True since it's for distributed training.
Steps to reproduce the bug
# datasets==2.21.0
import datasets
def data_prepare(examples):
print(examples["sentence1"][0])
return examples
batch_size = 101
# the size of the dataset is 100
# the dataset iterates correctly if we set either streaming=False or drop_last_batch=False
dataset = datasets.load_dataset("mteb/biosses-sts", split="test", streaming=True)
dataset = dataset.map(lambda x: data_prepare(x),
drop_last_batch=True,
batched=True, batch_size=batch_size)
for ex in dataset:
print(ex)
pass
Expected behavior
The dataset iterates regardless of the batch size.
Describe the bug
Hi there,
I use streaming and interleaving to combine multiple datasets saved in jsonl files. The size of dataset can vary (from 100ish to 100k-ish). I use dataset.map() and a big batch size to reduce the IO cost. It was working fine with datasets-2.16.1 but this problem shows up after I upgraded to datasets-2.19.2. With 2.21.0 the problem remains.
Please see the code below to reproduce the problem.
The dataset can iterate correctly if we set either streaming=False or drop_last_batch=False.
I have to use drop_last_batch=True since it's for distributed training.
Steps to reproduce the bug
Expected behavior
The dataset iterates regardless of the batch size.
Environment info
datasets
version: 2.21.0huggingface_hub
version: 0.24.5fsspec
version: 2024.2.0