poptorch.DataLoader only loads an integer number of batches

owainkenwayucl commented 3 months ago

I've found a bug in Poptorch 3.3.0 (is this fixed in 3.4.0? I don't see it mentioned in the bugfixes https://docs.graphcore.ai/projects/release-notes/en/3.4.0/release_notes/poptorch.html).

Basically, it appears that the poptorch.DataLoader only loads an Integer number of batches, so if your batch size does not exactly divide the number of items in your dataset, it leaves out some of the data.

As an example, the training set for FashionMNIST is 60,000 images. If we load this in a poptorch.DataLoader with a batch size of 128, we lose 96 records:

import torch
import poptorch
import torchvision

train_dataset = torchvision.datasets.FashionMNIST("data/", transform=torchvision.transforms.ToTensor(), download=True, train=True)

opts = poptorch.Options()

train_dataloader = poptorch.DataLoader(
    opts, train_dataset, batch_size=128, shuffle=True, num_workers=20
)

num = 0

for data, labels in train_dataloader:
    num += labels.shape[0]

print(f"Loaded {num} records from the FashionMNIST training set of 60,000 records.")

(poptorch_env) [uccaoke@mandelbrot Graphcore]$ python3 popdlbug.py 
Loaded 59904 records from the FashionMNIST training set of 60,000 records.

owainkenwayucl commented 3 months ago

Ah! I have discovered drop_last which defaults to True but if you set to False resolves this behaviour. Closing.

owainkenwayucl commented 3 months ago

Er.

But then of course the compile results in code that doesn't work, because the size of the last batch is wrong:

[17:22:21.466] [poptorch::python] [warning] The number of elements in the dataset (89996) is not divisible by the number of elements processed per step (256) and drop_last=False. The last tensor will have a bat
ch size of 140. To avoid having to handle this special case switch to drop_last=True

Followed by

[17:24:11.400] [poptorch::python] [critical] poptorch.poptorch_core.Error: 'poptorch_py_error': Shape mismatch for input_: expected torch.Size([256, 3, 28, 28]) but got torch.Size([140, 3, 28, 28]).
This error occurred because the inputs passed at runtime don't match the inputs used to compile the model.
To recompile the model for the new inputs create a new inferenceModel / trainingModel wrapper or call destroy() on the curent one and try again.

Traceback (most recent call last):
  File "/home/uccaoke/Source/ML_Playground/MedMNIST/Graphcore/test.py", line 128, in <module>
    _, loss = poptorch_model(inputs, targets)
  File "/home/uccaoke/Applications/PopTorch/poptorch_env/lib64/python3.9/site-packages/poptorch/_poplar_executor.py", line 1261, in __call__
    self._executable_inputs.validateInputs(in_tensors)
  File "/home/uccaoke/Applications/PopTorch/poptorch_env/lib64/python3.9/site-packages/poptorch/_args_parser.py", line 138, in validateInputs
    validate(self.arg_names[i], arg, inputs.args[i])
  File "/home/uccaoke/Applications/PopTorch/poptorch_env/lib64/python3.9/site-packages/poptorch/_args_parser.py", line 117, in validate
    raise _impl.createPoptorchError(
poptorch.poptorch_core.Error: 'poptorch_py_error': Shape mismatch for input_: expected torch.Size([256, 3, 28, 28]) but got torch.Size([140, 3, 28, 28]).
This error occurred because the inputs passed at runtime don't match the inputs used to compile the model.
To recompile the model for the new inputs create a new inferenceModel / trainingModel wrapper or call destroy() on the curent one and try again.

AnthonyBarbier commented 3 months ago

There is a note in the documentation about this issue: https://docs.graphcore.ai/projects/poptorch-user-guide/en/latest/batching.html#rebatching-iterable-datasets

Basically, you can't run an iteration on a partial batch because that would essentially require a recompilation of the model, so you either need to drop the last batch, or you need to re-batch in order to make the last batch a full batch.

owainkenwayucl commented 3 months ago

That makes sense. It does make things awkwardly complicated though! I guess it'd be neat to be able to trigger two compilations - one for the full batch size and one for the remainder.

graphcore / poptorch

poptorch.DataLoader only loads an integer number of batches #21