oracle / tribuo

Tribuo - A Java machine learning library
https://tribuo.org
Apache License 2.0
1.27k stars 176 forks source link

CART Regression throws ArrayIndexOutOfBoundsException when using TrainTestSplit with proportion 1.0 #374

Open Artraxon opened 3 months ago

Artraxon commented 3 months ago

Describe the bug When using the TrainTestSplitter with split proportion 1.0 and seed 1L with a SQLDataSource that retrieves 1107 tuples with below trainer configuration the following exception is thrown.

java.lang.ArrayIndexOutOfBoundsException: arraycopy: last destination index 1116 out of bounds for int[1107]
    at java.base/java.lang.System.arraycopy(Native Method)
    at org.tribuo.regression.rtree.impl.InvertedFeature.split(InvertedFeature.java:173)
    at org.tribuo.regression.rtree.impl.TreeFeature.split(TreeFeature.java:155)
    at org.tribuo.regression.rtree.impl.RegressorTrainingNode.splitAtBest(RegressorTrainingNode.java:322)
    at org.tribuo.regression.rtree.impl.RegressorTrainingNode.buildGreedyTree(RegressorTrainingNode.java:204)
    at org.tribuo.regression.rtree.impl.RegressorTrainingNode.buildTree(RegressorTrainingNode.java:152)
    at org.tribuo.regression.rtree.CARTRegressionTrainer.train(CARTRegressionTrainer.java:210)
    at org.tribuo.regression.rtree.CARTRegressionTrainer.train(CARTRegressionTrainer.java:60)
    at org.tribuo.ensemble.BaggingTrainer.trainSingleModel(BaggingTrainer.java:186)
    at org.tribuo.ensemble.BaggingTrainer.train(BaggingTrainer.java:168)
    at org.tribuo.ensemble.BaggingTrainer.train(BaggingTrainer.java:145)
    at org.tribuo.ensemble.BaggingTrainer.train(BaggingTrainer.java:140)
    at org.tribuo.ensemble.BaggingTrainer.train(BaggingTrainer.java:54)

To Reproduce I use the following configuration of the trainer:

CARTRegressionTrainer cartTrainer = new CARTRegressionTrainer(10,
                                                              AbstractCARTTrainer.MIN_EXAMPLES,
                                                              0.0F,
                                                              0.5F,
                                                              false,
                                                              new MeanAbsoluteError(),
                                                              Trainer.DEFAULT_SEED);
Trainer<Regressor> rfTrainer = new RandomForestTrainer<>(cartTrainer,
                                                         new AveragingCombiner(),
                                                         100,
                                                         5);

The error does not occur when using XGBoost, or when using the SQLDataSource directly without passing it through the splitter, even though the amount of tuples is the same.

Expected behaviour

I expect that using the TrainTestSplitter with a proportion of 1.0 behaves the same way as not using it at all (or at least not producing an error)

System information:

Craigacp commented 3 months ago

Can you provide the code where you construct the dataset with and without the train test split? And also ask the training dataset how big it is?

I agree that using a train test split of 1.0 shouldn't crash the training run (though the test dataset will be malformed and we should put proper validation on the trainProportion argument), but I can't quite see where it's triggering the issue, especially if XGBoost is fine. While the trees & XGBoost have different methods to iterate the dataset, they both rely on the underlying list inside the dataset for their size information, so if that's an odd size I'd expect both of them to break.

Artraxon commented 3 months ago

I can provide the code, but I don't think it will help much because it is very generic as it is part of a bigger system. The training set is 1107 tuples big. The data set is as far as I can tell the, same when using the TrainTestSplitter or taking it directly from the SQLDataSource, although of course I don't know for which properties to look for.

        var sourceQuery = config.getDatasourceQuery();

        SQLDataSource<O> sqlSource = null;
        try {
            sqlSource = new SQLDataSource<>(
                    sourceQuery,
                    new SQLDBConfig("jdbc:duckdb:" + dbPath, Map.of()),
                    outputFactorySupplier.get(),
                    rowProcessor,
                    true);
        } catch (SQLException e) {
            throw new RuntimeException(e);
        }
        var totalSize = sizeFunction.apply(config);
        double splitProportion;
        if (config.getSampleAmount() < totalSize) {
            splitProportion = ((double) config.getSampleAmount()) / totalSize;
        } else splitProportion = 1;

        MutableDataset<O> trainingData;
        MutableDataset<O> testData = null;
        //Avoid passing it through train test splitter with split proportion of 1 because of a bug in tribuo
        //causing an exception in the CART Regression Trainer https://github.com/oracle/tribuo/issues/374
        //Might also occur when using a proportion thats slightly less than 1.0 (like 0.99)
        if (splitProportion < 1) {
            var splitter = trainTestSplitter.apply(sqlSource, splitProportion);
            trainingData = new MutableDataset<>(splitter.getTrain());
            testData = new MutableDataset<>(splitter.getTest());
        } else {
            trainingData = new MutableDataset<>(sqlSource);
        }
Craigacp commented 3 months ago

Can you check if the featureIDMap from the MutableDataset when using the splitter and without is equal to the other? And if the problem still exists if you use CARTJointRegressionTrainer instead of CARTRegressionTrainer?

Artraxon commented 2 months ago

Sorry for my late response, I'm quite busy with my master thesis at the moment, I'll have a bit more time in October. At the moment I don't have the time to reproduce the exact error again as I'm also changing the training data and setup a lot, but I'd be able to reproduce it later and help find the error.

In the meantime, it got the same exception when training a random forest without splitting, but rather using the MeanSquaredError on a dataset that contains doubles pretty much as close to zero as doubles allow it, but only when the feature set contains both categorical and real features.

I think it is a bit difficult to debug this by just describing the characteristics of the dataset, I'd need to share them with you. At the moment I can't do that as we might also use the datasets in a publication, but afterwards I can provide the exact datasets and code to replicate the errors.

Craigacp commented 2 months ago

Ok, so that sounds a lot more like a bug in the tree implementation itself rather than an issue with the train test splitter. Which is good, because the splitter is very simple and I really couldn't see what could go wrong there, but the tree implementation code is tricky and may well still have bugs. CARTJointRegressionTrainer and CARTRegressionTrainer have different underlying tree implementations, so if you could compare those two on identical datasets (as they should perform identically with only a single output dimension) that will help me narrow down where the issue is.