Closed owainkenwayucl closed 3 months ago
Ah! I have discovered drop_last which defaults to True but if you set to False resolves this behaviour. Closing.
Er.
But then of course the compile results in code that doesn't work, because the size of the last batch is wrong:
[17:22:21.466] [poptorch::python] [warning] The number of elements in the dataset (89996) is not divisible by the number of elements processed per step (256) and drop_last=False. The last tensor will have a bat
ch size of 140. To avoid having to handle this special case switch to drop_last=True
Followed by
[17:24:11.400] [poptorch::python] [critical] poptorch.poptorch_core.Error: 'poptorch_py_error': Shape mismatch for input_: expected torch.Size([256, 3, 28, 28]) but got torch.Size([140, 3, 28, 28]).
This error occurred because the inputs passed at runtime don't match the inputs used to compile the model.
To recompile the model for the new inputs create a new inferenceModel / trainingModel wrapper or call destroy() on the curent one and try again.
Traceback (most recent call last):
File "/home/uccaoke/Source/ML_Playground/MedMNIST/Graphcore/test.py", line 128, in <module>
_, loss = poptorch_model(inputs, targets)
File "/home/uccaoke/Applications/PopTorch/poptorch_env/lib64/python3.9/site-packages/poptorch/_poplar_executor.py", line 1261, in __call__
self._executable_inputs.validateInputs(in_tensors)
File "/home/uccaoke/Applications/PopTorch/poptorch_env/lib64/python3.9/site-packages/poptorch/_args_parser.py", line 138, in validateInputs
validate(self.arg_names[i], arg, inputs.args[i])
File "/home/uccaoke/Applications/PopTorch/poptorch_env/lib64/python3.9/site-packages/poptorch/_args_parser.py", line 117, in validate
raise _impl.createPoptorchError(
poptorch.poptorch_core.Error: 'poptorch_py_error': Shape mismatch for input_: expected torch.Size([256, 3, 28, 28]) but got torch.Size([140, 3, 28, 28]).
This error occurred because the inputs passed at runtime don't match the inputs used to compile the model.
To recompile the model for the new inputs create a new inferenceModel / trainingModel wrapper or call destroy() on the curent one and try again.
There is a note in the documentation about this issue: https://docs.graphcore.ai/projects/poptorch-user-guide/en/latest/batching.html#rebatching-iterable-datasets
Basically, you can't run an iteration on a partial batch because that would essentially require a recompilation of the model, so you either need to drop the last batch, or you need to re-batch in order to make the last batch a full batch.
That makes sense. It does make things awkwardly complicated though! I guess it'd be neat to be able to trigger two compilations - one for the full batch size and one for the remainder.
I've found a bug in Poptorch 3.3.0 (is this fixed in 3.4.0? I don't see it mentioned in the bugfixes https://docs.graphcore.ai/projects/release-notes/en/3.4.0/release_notes/poptorch.html).
Basically, it appears that the poptorch.DataLoader only loads an Integer number of batches, so if your batch size does not exactly divide the number of items in your dataset, it leaves out some of the data.
As an example, the training set for FashionMNIST is 60,000 images. If we load this in a poptorch.DataLoader with a batch size of 128, we lose 96 records: