tommyod / Efficient-Apriori

An efficient Python implementation of the Apriori algorithm.
MIT License
321 stars 58 forks source link

run time #30

Closed maxsombra closed 4 years ago

maxsombra commented 4 years ago

en python 3.7 with a file of rows 939 and 50 characterstics , the algorithm takes more than 2 hours, and I have seen this algorithm run much faster with more data, you know what the problem is.

this are the librarys i have install :

image

tommyod commented 4 years ago

Try setting 'min_support' to a high value (e.g. 0.99), then it should run fast. The downside is that few rules will be returned. Decrease the value gradually and re-run till you strike a balance between runtime and the number of rules returned. You can also experiment with "min_confidence".

The problem is likely that there are many frequent itemsets in your particular data set, and has little to do with packages installed.

Please report back if it works! :+1:

pythobot commented 4 years ago

Hi Tommy,

We have been testing your library iterating through different possible values for min_support and min_confidence to see if it changes, without success. This was our last test: min_support=0.001, min_confidence=0.001, max_length=30

Execution still takes a long time for a dataset of 930 transactions and 181 columns.

However we are comparing to an old version of your library (==0.4.5 according to pip) in a python 3.4.3 virtualenv which was installed a year ago until now that we need to run apriori again today. In this version the library runs fine and fast (10 seconds tops).

So when trying to set up a new environment, after installing all versions, testing one by one, execution is still pretty slow. We have tried python 3.4, 3.6, and 3.7 without success.

Somehow that only version we installed back in that date is still working, all others seem to have downgraded the performance really significantly.

We have tested in various computers, being the fastest one an Ubuntu 18.04 fresh install with 30GB ram and a 12-core processor, and still takes a pretty long time to what seems to be a small dataset described above.

Here comes a strange finding: when installing efficient-apriori==0.4.5 today, execution is still slow, it does not behave exactly like the old install. It looks like the code for version (0.4.5) at https://github.com/tommyod/Efficient-Apriori/tree/7ea0a1c2265249c8cb9a66d31055099f40231bf2/efficient_apriori is completely different to what we have in our old install and we can't seem to find that in the repo's history.

Any thoughts? Thanks!

tommyod commented 4 years ago

Thank you for the detailed report. This is a bit strange, as i cannot recall any changes that would degrade performance. I'll look into it. Do you have the opportunity to share the data set? (Perhaps with values mapped to integers to obfuscate it)

tommyod commented 4 years ago

when installing efficient-apriori==0.4.5 today, execution is still slow, it does not behave exactly like the old install.

This is also very weird. Do you have the old install on a computer? Can you email the source code to me? My email is found here.

tommyod commented 4 years ago

There are some choices to be made with missing data, and for making the algorithm run fast enough. Here are my thoughts.

Missing values

I took a look at your adult data and you do know and have all the variables for each transaction, but what if you don't actually know all the values?

You have two options, depending on the nature of the missing data:

Here's a way to encode the missing values as explained in Option 2:

import pandas as pd
from efficient_apriori import apriori

# The data set has missing values
df = pd.read_csv("data.csv") # Data has shape approximately (1000, 60)

# Get the columns a strings
columns = list(df.columns)

transactions = []  # Prepare a list of transactions

# Loop over every row in the DataFrame to collect transactions
for i in range(len(df)):

    # Encode values as e.g., 'col1=N/A' and 'col2=male'.
    # We include column names to differentiate between values such as
    # 'col1=N/A' and 'col2=N/A', since they might mean different things
    # in the analysis (and might not be missing at random)
    values = list(t.strip() for t in df.iloc[i, :].fillna("N/A").values)
    transaction = tuple(f"{column}={value}" for column, value in zip(columns, values))

    # Add transaction to transactions
    transactions.append(transaction)

itemsets, rules = apriori(transactions, min_support=0.8,
                            min_confidence=0.5, max_length=6, verbosity=1)

Running time

This data has many columns, and by encoding every missing values as above the transactions each have 60 observations. This results in combinatorial explosion in the algorithm, so it's very important that the min_support is high enough to prune the data as the algorithm works. (Above I wrote that min_support should be low to make the algorithm run fast. This was wrong, and I edited my post)

Here's a comparison of running time and min_support for this particular data set, which has approx 1000 transactions with 60 observations each. I set min_confidence = 0.5 and max_length = 6 in the plot below.

image

If the algorithm runs for a long time, it will return a lot of junk. To be able to interpret the results and extract meaningful insights you want a few, meaningful rules. This typically means setting min_support and min_confidence to reasonably high values, which also means a shorter running time.

I hope this helps. Let me know if you have any further questions @arnaldoanez @maxsombra .

pythobot commented 4 years ago

We were testing in fact only including non-missing observations but the values we tried for min_support were just too low (less than 0.05) hence why it was taking so long and we hadn't noticed that was just too broad.

Thanks for the clarification, we managed to run it successfully now.

Cheers

Et9797 commented 1 year ago

I've ran efficient apriori analyzing millions of customer orders containing shop products using min_support and min_confidence values as low as 0.0005 & 0.01, respectively. Generally it didn't take longer than 10 minutes to run. A min_support of 0.0001 however seems to be the lowest limit as it always then kills the Python process for some reason. 32 GB ram, Ubuntu 22.