Loader has length 0 when batch_size > len(indices)

libffcv / ffcv

FFCV: Fast Forward Computer Vision (and other ML workloads!)

https://ffcv.io

Apache License 2.0

2.84k stars 178 forks source link

Loader has length 0 when batch_size > len(indices) #171

Closed kaixiao closed 2 years ago

kaixiao commented 2 years ago

Initializing the loader as follows:

from ffcv.loader import Loader

loader = Loader(train_dataset,
                batch_size=batch_size,
                num_workers=num_workers,
                indices=indices)

results in a loader of length 0 when batch_size > len(indices).

I think a warning, an error, or returning a length 1 loader containing exactly len(indices) elements in the loader would all be reasonable behaviors that are more intuitive than returning a length 0 loader.

GuillaumeLeclerc commented 2 years ago

@kaixiao it seems that in your example you have drop_last=True. We aim to maximize compatibility with the pytorch dataloader and it seems that we have exactly the same behavior

In [3]: ch.utils.data.TensorDataset(ch.Tensor([(0, 1), (1, 2)]))
Out[3]: <torch.utils.data.dataset.TensorDataset at 0x7f824f9a47c0>

In [4]: dataset = ch.utils.data.TensorDataset(ch.Tensor([(0, 1), (1, 2)]))

In [5]: loader = ch.utils.data.DataLoader(dataset, batch_size=3, drop_last=True)

In [6]: len(loader)
Out[6]: 0

Feel free to reopen if you see a discrepancy with pytorch's DataLoader!

kaixiao commented 2 years ago

Thanks for the clarification, @GuillaumeLeclerc! In that case, I'm wondering if the default behavior for ffcv loaders should be drop_last=False instead, since that seems to be the pytorch default. But it's helpful to know that this is an easy fix.

Default drop_last behavior in Pytorch: ` In [2]: dataset = ch.utils.data.TensorDataset(ch.Tensor([(0, 1), (1, 2)]))

In [3]: loader = ch.utils.data.DataLoader(dataset, batch_size=3)

In [4]: len(loader) Out[4]: 1 `

GuillaumeLeclerc commented 2 years ago

That's definitely an oversight from us. I'm not sure changing now as people might be already relying on the default value is the thing to do though.