Dataset.createTransformers incorrectly iterated Dataset.data rather than using the dataset iterator. As the dataset iterator is overridden in DatasetView but the data array is empty this caused the transformation to be fit incorrectly and the feature values to be corrupted.
The fix causes Dataset.createTransformers to use Dataset.size() and Dataset.iterator() both of which can be overridden. The PR also includes two additional fixes for DatasetView behaviour as shuffling was incorrect because it could return data points that weren't in the view, and the provenance recorded the wrong indices (it was tracking the shuffle indices not the indices selected for the view). I think this covers all direct uses of Dataset.data so they now are routed through the proper methods.
Motivation
This interaction causes poor performance when using TransformTrainer and CrossValidation, leading to random performance on the MNIST test I did, after we found it in a different internal usecase.
Description
Dataset.createTransformers
incorrectly iteratedDataset.data
rather than using the dataset iterator. As the dataset iterator is overridden inDatasetView
but the data array is empty this caused the transformation to be fit incorrectly and the feature values to be corrupted.The fix causes
Dataset.createTransformers
to useDataset.size()
andDataset.iterator()
both of which can be overridden. The PR also includes two additional fixes forDatasetView
behaviour as shuffling was incorrect because it could return data points that weren't in the view, and the provenance recorded the wrong indices (it was tracking the shuffle indices not the indices selected for the view). I think this covers all direct uses ofDataset.data
so they now are routed through the proper methods.Motivation
This interaction causes poor performance when using
TransformTrainer
andCrossValidation
, leading to random performance on the MNIST test I did, after we found it in a different internal usecase.