oracle / tribuo

Tribuo - A Java machine learning library
https://tribuo.org
Apache License 2.0
1.24k stars 172 forks source link

Dataset.createTransformers fix for DatasetView/TransformTrainer #364

Closed Craigacp closed 2 months ago

Craigacp commented 3 months ago

Description

Dataset.createTransformers incorrectly iterated Dataset.data rather than using the dataset iterator. As the dataset iterator is overridden in DatasetView but the data array is empty this caused the transformation to be fit incorrectly and the feature values to be corrupted.

The fix causes Dataset.createTransformers to use Dataset.size() and Dataset.iterator() both of which can be overridden. The PR also includes two additional fixes for DatasetView behaviour as shuffling was incorrect because it could return data points that weren't in the view, and the provenance recorded the wrong indices (it was tracking the shuffle indices not the indices selected for the view). I think this covers all direct uses of Dataset.data so they now are routed through the proper methods.

Motivation

This interaction causes poor performance when using TransformTrainer and CrossValidation, leading to random performance on the MNIST test I did, after we found it in a different internal usecase.