rasbt / mlxtend

A library of extension and helper modules for Python's data analysis and machine learning libraries.
https://rasbt.github.io/mlxtend/
Other
4.92k stars 873 forks source link

mlxtend fpgrowth and association rules with the existence of missing values #1004

Open zazass8 opened 1 year ago

zazass8 commented 1 year ago

Describe the workflow you want to enable

I am trying to implement the fpgrowth algorithm on a dataset that I am working on. I do know for a fact that the input fpgrowth only accepts is in binary format (e.g. True, False or 1's,0's). My dataset is in binary format but I have some missing values within it. I can't really drop those missing values because that will result in dropping useful binary datapoints from the rest.

Describe your proposed solution

What I thought of doing, is to compute support and confidence metrics based on the frequency of items that exist in that binary map AND by also ignoring the missing values. I believe that would be doable and I did find how it's done from this paper (https://weber.itn.liu.se/~aidvi/courses/06/dm/papers/AssociationRules/RAR(98).pdf). The formulas for support and confidence are quite different, mainly they subtract the count of the missing values from the denominator of the support/confidence. They also include a new metric known as 'representativity' and it is an additional criterium that helps in determining better which association rules are the strongest ones.

Describe alternatives you've considered, if relevant

I tried to make a few edits on the open source code you have on github, but I got lost while reading it, it's a very abstract code to be honest. Because, you have been using a lot of recursions as well.

If it is possible you could help me to make these edits I will very much appreciate it.

Additional context

zuari1993 commented 1 year ago

Hi, It would be great if you can explain the exact formula that is mentioned in the paper and the exact changes to be made. I would be more than happy to take this up and make the changes.

I just don't have the time to read through the paper and identify the needed changes. Thanks.

zazass8 commented 1 year ago

Okay, so the old rule of support was given as the proportion of the existence of an item out of all transactions. Now, it is given as image where |B_x| is the count of item X, |B| is the count of transactions of the whole database and |Dis(X)| is the count of null values in the column of item X.

For confidence, given a rule as X->Y, the old definition was given as the proportion of the count of the existence of X and Y occurring together over the count of the existence of X. Now, the new definition is given as image where |B_xy| is the the count of the existence of X and Y occurring together, |B_x| is the count of the existence of X and |Dis(Y) interesection B_x| is the count of null values in the column of item Y AND the value in item X exists.

What I have done so far: 1) fpcommon.py :

In the valid_input_check(df) method I have commented out from the line that begins to define 'all_bools' until its end. That was done so that the API can also accept df's with NaN values as well. We are commenting out the code that is checking this.

In the setup_fptree(df, min_support) method, I added this code right above the line item_support = item_support.reshape(-1) image This, basically creates a copy of the original array and replaces the null values with a 1 and the rest with a NaN. To make it easier to compute the support for each itam. BUT, this is for each item. Keep in mind, that I do return the disabled array in the last line of this function.

In the generate_itemsets(generator, disabled, num_itemsets, colname_map) method, I use the disabled array from before as an additional argument and I made some modifications in the for loop as well. image This method, generates the sets of items that exceed a minimum support threshold AFTER the fp_tree is constructed. My code, says that if the set only consists of a single item, to compute the support as computed in the setup_fptree method. Elsewise, if the set has more than one items, to keep a count of the null values that appear at least once from all the items in that transaction. Then, we deduct that count from the denominator to obtain the final support.

2) fpgrowth.py :

Here, I just pass the disabled df as an argument to the setup_fptree and generate_itemsets methods like this image

3) association_rules.py

In this, file I haven't really done much because I wasn't sure if it was correct enough. All I did, I edited the definitions of confidence and lift like this image Basically, I included as arguments the disabled counts of C and A and subtract it from the denominators. For lift, as you know the formula is similar as to the one of confidence, now they divide it by |B_y|. For the new definition, I thought of transforming it as |Dis(X) interesection B_y| which is equivalent for |B_x| on the confidence definition.

But, I wasn't really sure on how to do it.

The very last thing I did, was to add a dis variable which I haven't really defined somewhere, within the final for loops. It's basically the definition of the disabled counts for each case. But still, wasn't sure on how to define it. Then, I append it to the rule_supports lists accordingly like this image and image at the very end.

All the above, was code that I haven't really tested yet which means it may be wrong at some parts. It's really confusing how to get through this, by also including the existence of missing values.

But the task sounds simple: apply fpgrowth by also excluding the count of missing values.

It gets very complicated when itemsets with multiple items are generated. And we also have to apply it in all cases, because the algorithm is supposed to filter out the ones with support value lower than the threshold given. The same, when we try to generate the rules.

But, anyway I believe we can get this through. Let me know if you have any further questions and I would really appreciate your time and effort for helping me with this.

rasbt commented 1 year ago

I can see that dealing with itemsets that have missing values may be a common problem, and I am open to modifications. I currently don't have time to look at the paper in detail, but based on your description, it sounds like the general fpgrowth algorithm remains the same, and the change is primarily in how the existing metrics are computed, plus the new "representativity" metric?

Before getting to that, I think a key consideration is also how to represent missing values. Given that the current version works with Bool data types, we probably have to change certain things. I don't think that pandas currently support NaN values for Bool data, so maybe we would have to change it to an integer representation, which could have performance implications.

I think the first step here is to think about what the input datatype and interface would look like. Tricky 🤔

zazass8 commented 1 year ago

Yes, exactly the algorithm remains the same except from the modification of the metrics that are computed. And yes, it includes the rule of representativity as well.

As I have mentioned above, I commented out some lines on the valid_input_check(df) function where it checks if the data in the df are boolean or binary type. I believe that by commenting out that, it should be fine? But still, as the author of the code you know better for this and maybe your suggestion is better.