rasbt / mlxtend

A library of extension and helper modules for Python's data analysis and machine learning libraries.
https://rasbt.github.io/mlxtend/
Other
4.91k stars 872 forks source link

Difference beetween fpgrowth and fpmax not documented #1030

Open emilianomm opened 1 year ago

emilianomm commented 1 year ago

Describe the documentation issue

Hi. I´m using the library to find association rules in a dataset. In order to do that, I´m passing the output of the three algorithms to the association_rules() function. The documentation says these are equivalent in terms of parameters and output, but I´m getting on the following error only with the output from fpmax() :

KeyError: 'frozenset({120})You are likely getting this error because the DataFrame is missing  antecedent and/or consequent  information. You can try using the  `support_only=True` option'

A minimal code example of my implementation would be like

from mlxtend.frequent_patterns import fpgrowth
from mlxtend.frequent_patterns import fpmax

### Assume baskets_matrix is an ad_hoc pandas df.

### This works OK
freq_items_1 = fpgrowth(baskets_matrix, min_support=0.1)
freq_items_2 = fpmax(baskets_matrix, min_support=0.1)

### This also works OK
AR_1 =association_rules(freq_items_1, metric="confidence", min_threshold=0.5)

### This raises the error
AR_2 =association_rules(freq_items_2, metric="confidence", min_threshold=0.5)

Since all other factors are the same, I have to assume that there is a difference in the output of fpgrowth and fpmax which is not clearly documented.

I also noticed that the documentation refers to the association_rules() function as generate_rules() which leads to further confussion.

Suggest a potential improvement or addition

I would like to ask if it´s possible to clarify if the output from the different algoriths are indeed different or there is another issue here.

Also, I think it will be useful for anyone using the library to have this remarks added on the documentatinon.

Thanks in advance!

Jordenjj commented 1 year ago

As per the documentation "FP-Max is a variant of FP-Growth, which focuses on obtaining maximal itemsets. An itemset X is said to maximal if X is frequent and there exists no frequent super-pattern containing X. In other words, a frequent pattern X cannot be sub-pattern of larger frequent pattern to qualify for the definition maximal itemset." That being said, I am getting the error too when using FP-Max.

josejub commented 1 year ago

Same here, when mining frequent itemsets with fp-growth it works fine, but when using fp-max I get the same error. a example of my code is:

Assume negated is a one-hot encoded dataframe

max = fpmax(negated, min_support=0.3, use_colnames=True, max_len=5) max rules = association_rules(max,metric="confidence", min_threshold=0.85) # Error appears here

Works well

max = fpgrowth(negated, min_support=0.3, use_colnames=True, max_len=5) max rules = association_rules(max,metric="confidence", min_threshold=0.85)