Open viciaky opened 6 years ago
Since association rule mining would be more like a "unsupervised" learning task, there's currently no API for the separate handling of training and testing. However, looking up entries in a second dataset (or test dataset) that have been covered by the rule should be possible. It's a bit clunky at the moment now.
Let's say you have generated the following rules from a training dataset
import pandas as pd
from mlxtend.preprocessing import OnehotTransactions
from mlxtend.frequent_patterns import apriori
dataset = [['Milk', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],
['Dill', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],
['Milk', 'Apple', 'Kidney Beans', 'Eggs'],
['Milk', 'Unicorn', 'Corn', 'Kidney Beans', 'Yogurt'],
['Corn', 'Onion', 'Onion', 'Kidney Beans', 'Ice cream', 'Eggs']]
oht = OnehotTransactions()
oht_ary = oht.fit(dataset).transform(dataset)
df_train = pd.DataFrame(oht_ary, columns=oht.columns_)
frequent_itemsets_train = apriori(df_train, min_support=0.6, use_colnames=True)
frequent_itemsets_train
from mlxtend.frequent_patterns import association_rules
training_rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.7)
training_rules
No, let's say we are interested in looking at the rule in the 2nd row (index position 1):
training_rules.values[1]
array([frozenset({'Eggs'}), frozenset({'Kidney Beans'}), 0.8, 1.0, 1.0], dtype=object)
Note that these are frozen-sets to make the algorithm more efficient during evaluation. To join the antecedent (first item) and consequent (second item) in a set, we can do sth as follows:
# join sets
itemset = set(training_rules.values[1][0]) | set(training_rules.values[1][1])
itemset
{'Eggs', 'Kidney Beans'}
Next, let's assume we have a test set that is formatted similar to the frequent_itemsets_train
set earlier. You may notice that the frequent itemsets are lists in these arrays (this could maybe be changed to frozensets in future for efficiency). Thus, two look up the itemset from the previous step in the itemsets contained in the data frame, we need to convert the itemsets from list to set representations first:
frequent_itemsets_test['itemsets'] = frequent_itemsets_test['itemsets'].apply(lambda x: set(x))
frequent_itemsets_test
Finally, we can identify the itemset that matches our rule as follows:
frequent_itemsets_test[frequent_itemsets_test['itemsets'] == itemset]
Since the support for this itemset is 0.6, it means that 60% of the data instances in the test set match this pattern. To compare training and test sets, you want to see, ideally, that the support for training and test set are the same, I'd say.
I hope this helps!
(I am going to leave this issue open right now, since I think a cleaned-up version of this would be worthwhile adding to the documentation in future)
Thank you very much~! Yesterday i was trying to do something like you said. This really help me a lot. 👍
The issue raised by @viciaky is very relevant and therefore more thoughts and API to support the benchmarking of the result of the association rules would really be useful.
Perhaps in your code above there is a typo: the following line
association_rules(frequent_itemsets, metric="confidence", min_threshold=0.7)
should be instead
association_rules(frequent_itemsets_train, metric="confidence", min_threshold=0.7)
Furthermore, you might want to add a line showing how to generate a
frequent_itemsets_test['itemsets']
Thanks, @rasbt for instructions on how to look up the itemsets.
My issue: Looking up antecedents and consequents in a association_rules dataframe was a headache even when using .astype('str').str.contains('(Something)') ...
the antecedends and consequents columns are type object, and association_rule['antecedents'].astype('str') ---> converts them to frozenset({'Foo', 'Bar'})
What I did: association_rule['antecedents'] = association_rule['antecedents'].apply(lambda x: set(x)) association_rule['consequents'] = association_rule['consequents'].apply(lambda x: set(x))
Is there a better way for accessing these now?
Thanks for the feedback. The API is still unchanged and based on the frozensets; that's for efficiency reasons. You approach regarding .apply(lambda x: set(x))
seems to be the most reasonable right now.
As the issue title, i split the transaction records into train and test, the train set derived association rules. How can i check how many records in test set have been hit by these rules? Any function in the package i can use?