oracle / tribuo

Tribuo - A Java machine learning library
https://tribuo.org
Apache License 2.0
1.28k stars 178 forks source link

IntArrayContainer merge method creates duplicates int values #383

Open gilleshuron opened 6 days ago

gilleshuron commented 6 days ago

Describe the bug

This method:

org.tribuo.common.tree.impl.IntArrayContainer#merge(java.util.List<int[]>, org.tribuo.common.tree.impl.IntArrayContainer, org.tribuo.common.tree.impl.IntArrayContainer)

... incorrectly merges list of int[], creates duplicated int values during the merge.

At the least, it impacts performance. Variance computing is completely wrong in the CARTRegressionTrainer.

To Reproduce

Create a unit test with this code:

public void bug() {
    IntArrayContainer first = new IntArrayContainer(10);
    IntArrayContainer second = new IntArrayContainer(10);
    List<int[]> list = List.of(
        new int[]{1, 2, 4},
        new int[]{3, 5, 6, 7, 8, 9, 10}
    );
    IntArrayContainer.merge(list, first, second); //<-- buggy method
    Assertions.assertEquals(1, first.array[0]);
    Assertions.assertEquals(2, first.array[1]); //fails !!! array is filed with [1,1,2,2,3,4,4,5,6,7,8,9,10,0,0]
    //expected is [1,2,3,4,5,6,7,8,9,10]

  }

Expected behavior

//expected is [1,2,3,4,5,6,7,8,9,10]

System information:

Additional context

The mistake may lie in how the firstBuffer is initialized and updated in each subsequent merge:

public static int[] merge(List<int[]> input, IntArrayContainer firstBuffer, IntArrayContainer secondBuffer) {
    if (input.size() > 0) {
        firstBuffer.fill(input.get(0));
        for (int i = 1; i < input.size(); i++) { // Start from the second array
            merge(firstBuffer, input.get(i), secondBuffer);
            IntArrayContainer tmp = secondBuffer;
            secondBuffer = firstBuffer;
            firstBuffer = tmp;
        }
        return firstBuffer.copy();
    } else {
        return new int[0];
    }
}

Modification made: Changed the loop start from i = 0 to i = 1. The first array is already used to initialize firstBuffer, so the merging should start from the second array (i = 1).

Craigacp commented 6 days ago

Thanks, that's definitely a bug. Maybe it's part of the cause of #374 as well. I'll work up some more tests for IntArrayContainer and put in your fix.