haifengl / smile

Statistical Machine Intelligence & Learning Engine
https://haifengl.github.io
Other
5.97k stars 1.13k forks source link

Hard Coded Max Limit In FP Tree #776

Closed CharlieAprog closed 2 weeks ago

CharlieAprog commented 3 weeks ago

When running the FP Growth algorithm and defining the FP Tree thre is a maximum limit of unique items currently set. This means that no frequent pattern mining can be completed if the number of unique items is high enough I found this when mining a very sparse dataset with the following error

Exception has occurred: java.lang.ArrayIndexOutOfBoundsException "java.lang.ArrayIndexOutOfBoundsException: Index 230037 out of bounds for length 65536" in line 302, FPTree.class

Expected behavior I would expect there to be NO upper limit to the number of unique items possible

Actual behavior Currently the upper limit of unique items is 65536

Location FPTree.class

    /**
     * Returns the frequency of single items.
     * @param itemsets the transaction database.
     * @return the frequency of single items
     */
    private int[] freq(Stream<int[]> itemsets) {
        int n = Integer.parseInt(System.getProperty("smile.arm.items", "65536"));
        int[] f = new int[n];
        itemsets.forEach(itemset -> {
            numTransactions++;
            for (int i : itemset) f[i]++;
        });
        while (f[--n] == 0);
        return Arrays.copyOf(f, n+1);
    }

on line 298 there is the line
int n = Integer.parseInt(System.getProperty("smile.arm.items", "65536"));

this should be the total number of unique items in the dataset instead of a fixed value of 65536

Additional context Currently Version: Maven, smile-core: 3.1.1

haifengl commented 3 weeks ago

It is a default value, not hard-coded value. You can use java option -Dsmile.arm.items=XYZ to pass your number of items. Note that the input is a stream so that we can scan the data twice. Otherwise, we can scan the data in the first pass to find out the number of unique items. On the other hand, the user often has a good idea of their itemset. So we rely this property to run the algorithm in single pass, which is also faster.