Stream dataset does not iterate if the batch size is larger than the dataset size (related to drop_last_batch)

Describe the bug

Hi there,

I use streaming and interleaving to combine multiple datasets saved in jsonl files. The size of dataset can vary (from 100ish to 100k-ish). I use dataset.map() and a big batch size to reduce the IO cost. It was working fine with datasets-2.16.1 but this problem shows up after I upgraded to datasets-2.19.2. With 2.21.0 the problem remains.

Please see the code below to reproduce the problem.

The dataset can iterate correctly if we set either streaming=False or drop_last_batch=False.

I have to use drop_last_batch=True since it's for distributed training.

Steps to reproduce the bug

# datasets==2.21.0
import datasets
def data_prepare(examples):
    print(examples["sentence1"][0])
    return examples

batch_size = 101
# the size of the dataset is 100
# the dataset iterates correctly if we set either streaming=False or drop_last_batch=False 
dataset = datasets.load_dataset("mteb/biosses-sts", split="test", streaming=True)
dataset = dataset.map(lambda x: data_prepare(x),
                      drop_last_batch=True,
                      batched=True, batch_size=batch_size)
for ex in dataset:
    print(ex)
    pass

Expected behavior

The dataset iterates regardless of the batch size.

Environment info

datasets version: 2.21.0
Platform: Linux-6.1.58+-x86_64-with-glibc2.35
Python version: 3.10.14
huggingface_hub version: 0.24.5
PyArrow version: 17.0.0
Pandas version: 2.2.2
fsspec version: 2024.2.0

huggingface / datasets