mhahsler / arules

Mining Association Rules and Frequent Itemsets with R
http://mhahsler.github.io/arules
GNU General Public License v3.0
194 stars 42 forks source link

Parameter to define minimum support (inclusion) #72

Closed GauthierMagnin closed 2 years ago

GauthierMagnin commented 2 years ago

Hello,

When mining frequent itemsets, I read that the parameter support defines the minimum support to consider an itemset as frequent. In other words, an itemset must have a support >= to the given threshold to be considered as frequent.

In the following example, there are two transactions containing two different itemsets and therefore having a support of 0.5. However, if a support threshold of 0.5 is used as parameter, no itemset is considered frequent whereas a threshold of 0.4 does consider the two itemsets as frequent itemsets.

library(arules) # 1.7-3

data = list(t1 = "A",
            t2 = "B")
labels = c("A", "B")
transact = as(encode(data, labels), "transactions")

eclat(transact, parameter = list(support = 0.5)) # First call
eclat(transact, parameter = list(support = 0.4)) # Second call

Here is the output of the first call:

Eclat

parameter specification:
 tidLists support minlen maxlen            target  ext
    FALSE     0.5      1     10 frequent itemsets TRUE

algorithmic control:
 sparse sort verbose
      7   -2    TRUE

Absolute minimum support count: 1 

eclat - zero frequent items
set of 0 itemsets

Here is the output of the second call:

Eclat

parameter specification:
 tidLists support minlen maxlen            target  ext
    FALSE     0.4      1     10 frequent itemsets TRUE

algorithmic control:
 sparse sort verbose
      7   -2    TRUE

Absolute minimum support count: 0 

create itemset ... 
set transactions ...[2 item(s), 2 transaction(s)] done [0.00s].
sorting and recoding items ... [2 item(s)] done [0.00s].
creating bit matrix ... [2 row(s), 2 column(s)] done [0.00s].
writing  ... [2 set(s)] done [0.00s].
Creating S4 object  ... done [0.00s].
set of 2 itemsets 

Is this lower bound supposed to be included or excluded? Is there an issue about the consideration of the threshold (> instead of >=) or am I wrong with the interpretation of the documentation? If I am wrong, can we expect in the near future the addition of a parameter to choose whether the threshold is included or excluded?

GauthierMagnin commented 2 years ago

If this can help, here are two other examples. Both are about 4 transactions and a minimum support of 0.25. The first one contains 4 different itemsets and none is considered frequent. The second one contains 3 different itemsets and all are considered frequent. Some of the related itemsets exist in both examples and have the same support but are or are not considered as frequent.

First example:

data = list(t1 = "A",
            t2 = "B",
            t3 = "C",
            t4 = "D")
labels = c("A", "B", "C", "D")
transact = as(encode(data, labels), "transactions")

inspect(eclat(transact, parameter = list(support = 0.25)))
Eclat

parameter specification:
 tidLists support minlen maxlen            target  ext
    FALSE    0.25      1     10 frequent itemsets TRUE

algorithmic control:
 sparse sort verbose
      7   -2    TRUE

Absolute minimum support count: 1 

eclat - zero frequent items
set of 0 itemsets 

Second example:

data = list(t1 = "A",
            t2 = "B",
            t3 = "C",
            t4 = "A")
labels = c("A", "B", "C")
transact = as(encode(data, labels), "transactions")

inspect(eclat(transact, parameter = list(support = 0.25)))
Eclat

parameter specification:
 tidLists support minlen maxlen            target  ext
    FALSE    0.25      1     10 frequent itemsets TRUE

algorithmic control:
 sparse sort verbose
      7   -2    TRUE

Absolute minimum support count: 1 

create itemset ... 
set transactions ...[3 item(s), 4 transaction(s)] done [0.00s].
sorting and recoding items ... [3 item(s)] done [0.00s].
creating bit matrix ... [3 row(s), 4 column(s)] done [0.00s].
writing  ... [3 set(s)] done [0.00s].
Creating S4 object  ... done [0.00s].
    items support count
[1] {A}   0.50    2    
[2] {B}   0.25    1    
[3] {C}   0.25    1 
mhahsler commented 2 years ago

Thank you for the detailed bug report. The issue with eclat is now (hopefully) fixed in the development version on GitHub. Let me know if the results are now not as expected.

The fix will be part of the next CRAN release. Please use the GitHub version till then.

Regards, -MFH

GauthierMagnin commented 2 years ago

It seems to work as expected now. Thank you very much for the quick fix.

luciat-92 commented 1 year ago

Hello,

I have a similar issue related to the minimum support parameter that gives an error. I am using arules package version 1.7.5 and R 4.1.0. In particular, it gives an error when the absolute minimum support count is the same as the maximum number of transactions for an item.

Here is an example:

nsamples_tot <- 2058
transactions <- matrix(0, nrow = 303, ncol = 179)
transactions[1:21,1] <- 1
transactions[10:20,2] <- 1

globalSupport <- 0.01
nsamples <- ceiling(globalSupport * nsamples_tot)
minSupport <- nsamples/nrow(transactions)
minSupport
[1] 0.06930693
eclat(transactions, parameter = list(support = minSupport))
Eclat

parameter specification:
 tidLists    support minlen maxlen            target  ext
    FALSE 0.06930693      1     10 frequent itemsets TRUE

algorithmic control:
 sparse sort verbose
      7   -2    TRUE

Absolute minimum support count: 21 

create itemset ... 
set transactions ...[2 item(s), 303 transaction(s)] done [0.00s].
sorting and recoding items ... [0 item(s)] done [0.00s].
Error in eclat(transactions, parameter = list(support = minSupport)) : 
  no items or transactions to work on

However, when passing manually minSupport as 0.06930693 it works:

inspect(eclat(transactions, parameter = list(support = 0.06930693)))
Eclat

parameter specification:
 tidLists    support minlen maxlen            target  ext
    FALSE 0.06930693      1     10 frequent itemsets TRUE

algorithmic control:
 sparse sort verbose
      7   -2    TRUE

Absolute minimum support count: 20 

create itemset ... 
set transactions ...[2 item(s), 303 transaction(s)] done [0.00s].
sorting and recoding items ... [1 item(s)] done [0.00s].
creating sparse bit matrix ... [1 row(s), 303 column(s)] done [0.00s].
writing  ... [1 set(s)] done [0.00s].
Creating S4 object  ... done [0.00s].
    items support    count
[1] {1}   0.06930693 21   

But gives zero set when using an approximation to the second decimal point, although the absolute minimum support count is 21.

inspect(eclat(transactions, parameter = list(support = 0.07)))
Eclat

parameter specification:
 tidLists support minlen maxlen            target  ext
    FALSE    0.07      1     10 frequent itemsets TRUE

algorithmic control:
 sparse sort verbose
      7   -2    TRUE

Absolute minimum support count: 21 

eclat - zero frequent items

If I understood correctly the function should include the case >=.

Thanks for your help!

mhahsler commented 1 year ago

Number representation is, unfortunately, complicated (binary representation, floating-point representation, and rounding). When you say

minSupport <- nsamples/nrow(transactions)

then the result of the division may be rounded up at the last representable digit, and that is why you find no results.

You need to manually make sure that you always round down.

minSupport <- nsamples/nrow(transactions)
minSupport
[1] 0.06930693
sprintf("%.100f", minSupport)
[1] "0.0693069306930693129764620152855059131979942321777343750000000000000000000000000000000000000000000000"
# round down with 6 digits
dig <- 6
minSupport_rounded_down <- round(minSupport - .5*10^(-dig), digits = dig)
sprintf("%.100f", minSupport_rounded_down)
[1] "0.0693060000000000064890315343291149474680423736572265625000000000000000000000000000000000000000000000"

eclat(transactions, parameter = list(support = minSupport_rounded_down))
Eclat

parameter specification:
 tidLists  support minlen maxlen            target  ext
    FALSE 0.069306      1     10 frequent itemsets TRUE

algorithmic control:
 sparse sort verbose
      7   -2    TRUE

Absolute minimum support count: 20 

create itemset ... 
set transactions ...[2 item(s), 303 transaction(s)] done [0.00s].
sorting and recoding items ... [1 item(s)] done [0.00s].
creating sparse bit matrix ... [1 row(s), 303 column(s)] done [0.00s].
writing  ... [1 set(s)] done [0.00s].
Creating S4 object  ... done [0.00s].
set of 1 itemsets 
luciat-92 commented 1 year ago

Thank you very much for the help!

mhahsler commented 1 year ago

I have now added code to eclat and apriori to round down automatically at the C level. This will prevent this unexpected behavior in the future. The addition will be part of the next release.

Thanks again for the comprehensive code that shows the behavior!