rasbt / mlxtend

A library of extension and helper modules for Python's data analysis and machine learning libraries.
https://rasbt.github.io/mlxtend/
Other
4.92k stars 873 forks source link

How to check Association rules has how many hits on new records? #343

Open viciaky opened 6 years ago

viciaky commented 6 years ago

As the issue title, i split the transaction records into train and test, the train set derived association rules. How can i check how many records in test set have been hit by these rules? Any function in the package i can use?

rasbt commented 6 years ago

Since association rule mining would be more like a "unsupervised" learning task, there's currently no API for the separate handling of training and testing. However, looking up entries in a second dataset (or test dataset) that have been covered by the rule should be possible. It's a bit clunky at the moment now.

Let's say you have generated the following rules from a training dataset

Generate rules


import pandas as pd
from mlxtend.preprocessing import OnehotTransactions
from mlxtend.frequent_patterns import apriori

dataset = [['Milk', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],
           ['Dill', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],
           ['Milk', 'Apple', 'Kidney Beans', 'Eggs'],
           ['Milk', 'Unicorn', 'Corn', 'Kidney Beans', 'Yogurt'],
           ['Corn', 'Onion', 'Onion', 'Kidney Beans', 'Ice cream', 'Eggs']]

oht = OnehotTransactions()
oht_ary = oht.fit(dataset).transform(dataset)
df_train = pd.DataFrame(oht_ary, columns=oht.columns_)

frequent_itemsets_train = apriori(df_train, min_support=0.6, use_colnames=True)

frequent_itemsets_train

screen shot 2018-03-14 at 9 23 11 am

from mlxtend.frequent_patterns import association_rules

training_rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.7)

training_rules

screen shot 2018-03-14 at 9 23 50 am

Selecting rules

No, let's say we are interested in looking at the rule in the 2nd row (index position 1):

training_rules.values[1]
array([frozenset({'Eggs'}), frozenset({'Kidney Beans'}), 0.8, 1.0, 1.0], dtype=object)

Note that these are frozen-sets to make the algorithm more efficient during evaluation. To join the antecedent (first item) and consequent (second item) in a set, we can do sth as follows:

# join sets
itemset = set(training_rules.values[1][0]) | set(training_rules.values[1][1]) 
itemset
{'Eggs', 'Kidney Beans'}

Looking up the itemset of interest in the test set

Next, let's assume we have a test set that is formatted similar to the frequent_itemsets_train set earlier. You may notice that the frequent itemsets are lists in these arrays (this could maybe be changed to frozensets in future for efficiency). Thus, two look up the itemset from the previous step in the itemsets contained in the data frame, we need to convert the itemsets from list to set representations first:

frequent_itemsets_test['itemsets'] = frequent_itemsets_test['itemsets'].apply(lambda x: set(x))
frequent_itemsets_test

screen shot 2018-03-14 at 9 31 36 am

Finally, we can identify the itemset that matches our rule as follows:

frequent_itemsets_test[frequent_itemsets_test['itemsets'] == itemset]

screen shot 2018-03-14 at 9 33 32 am

Since the support for this itemset is 0.6, it means that 60% of the data instances in the test set match this pattern. To compare training and test sets, you want to see, ideally, that the support for training and test set are the same, I'd say.

I hope this helps!

(I am going to leave this issue open right now, since I think a cleaned-up version of this would be worthwhile adding to the documentation in future)

viciaky commented 6 years ago

Thank you very much~! Yesterday i was trying to do something like you said. This really help me a lot. 👍

lucatoldo commented 5 years ago

The issue raised by @viciaky is very relevant and therefore more thoughts and API to support the benchmarking of the result of the association rules would really be useful.

Perhaps in your code above there is a typo: the following line association_rules(frequent_itemsets, metric="confidence", min_threshold=0.7) should be instead association_rules(frequent_itemsets_train, metric="confidence", min_threshold=0.7)

Furthermore, you might want to add a line showing how to generate a frequent_itemsets_test['itemsets']

Tanishk-Sharma commented 3 years ago

Thanks, @rasbt for instructions on how to look up the itemsets.

My issue: Looking up antecedents and consequents in a association_rules dataframe was a headache even when using .astype('str').str.contains('(Something)') ...

the antecedends and consequents columns are type object, and association_rule['antecedents'].astype('str') ---> converts them to frozenset({'Foo', 'Bar'})

What I did: association_rule['antecedents'] = association_rule['antecedents'].apply(lambda x: set(x)) association_rule['consequents'] = association_rule['consequents'].apply(lambda x: set(x))

Is there a better way for accessing these now?

rasbt commented 3 years ago

Thanks for the feedback. The API is still unchanged and based on the frozensets; that's for efficiency reasons. You approach regarding .apply(lambda x: set(x)) seems to be the most reasonable right now.