vaexio / vaex

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
https://vaex.io
MIT License
8.22k stars 590 forks source link

Value Counts sometime Gives Correct results and sometime not #2301

Closed ashsharma96 closed 1 year ago

ashsharma96 commented 1 year ago

Hey @JovanVeljanoski, Hope you are doing well. While I was working in vaex I found out value counts is not giving proper results even if vaex dataframe has data in it. I observed that sometime the wrong results came in first try and sometime it took 3-4 times executions of same code or function. Sometime it gives Proper Results in first try and it looks like it works fine but when I restart the jupyter kernel and retry again then it gives incorrect Results. Here is the code which I trying:

def remove_duplicates(df, grouping_cols: list):
    df["index"] = vaex.vrange(0, df.shape[0])
    df_group =  df.groupby(grouping_cols, agg=vaex.agg.min("index"))
    df = df.join(df_group[["index_min"]], left_on="index", right_on="index_min")
    df = df[df.index_min.notna()]
    df = df.drop(["index", "index_min"])
    df = df.extract()
    return df

def calculateNewEngegementClassification(unique):
    eng_count = unique['eng_type_new1'].value_counts().to_dict()
    return eng_count
df_new= vaex.open('error_data.hdf5')
dfInter = remove_duplicates(df_new,['tm_cid'])
df = calculateNewEngegementClassification(dfInter)
df

Correct Results:

 {
    "One_Brow": 2341,
    "One_Trans_AtRisk": 2385,
    "One_Trans_Lost": 219,
    "One_Trans_Potential": 228,
    "One_Trans_high_AtRisk": 159,
    "One_Trans_high_Lost": 1,
    "One_Trans_high_Potential": 29,
    "Rep_Brow": 1,
    "Rep_Trans_AtRisk": 76,
    "Rep_Trans_Lost": 67,
    "Rep_Trans_breakaway": 116,
    "Rep_Trans_high_AtRisk": 2,
    "Rep_Trans_high_breakaway": 12,
    "Rep_Trans_high_loyal": 521,
    "Rep_Trans_loyal": 296
    }

Incorrect Results:

{
    'One_Trans_AtRisk': 2385,
    'One_Brow': 2341,
    'Rep_Trans_high_loyal': 521,
    'Rep_Trans_loyal': 296,
    'One_Trans_Potential': 228,
    'One_Trans_high_AtRisk': 159,
    'Rep_Trans_breakaway': 116,
    'Rep_Trans_high_AtRisk': 2,
    'Rep_Brow': 1,
    'Rep_Trans_AtRisk': 1,
    'One_Trans_high_Potential': 1,
    'One_Trans_high_Lost': 1,
    'Rep_Trans_Lost': 1,
    'One_Trans_Lost': 1,
    'Rep_Trans_high_breakaway': 1
}

*Note : If error doesn't come in first try or second try then try atleast 5-6 times. Because at first it didn't came to my eye too. Data is also attached with this. error_data.zip @JovanVeljanoski @maartenbreddels Can you Please check if there's some issue in your value_counts because pandas is working fine. Regards,

JovanVeljanoski commented 1 year ago

Thanks for the report. I will take a look at the first opportunity.. Can you update your post above to include answers to the questions we ask? Like version etc.. those are important to track the issue.

JovanVeljanoski commented 1 year ago

I ran your example over 20 times on the latest version, under linux. I can't reproduce your issue. If you can provide more details, that would be great. Otherwise we can't debug what we can't reproduce..

I am doing this

import vaex

correct = {
    "One_Brow": 2341,
    "One_Trans_AtRisk": 2385,
    "One_Trans_Lost": 219,
    "One_Trans_Potential": 228,
    "One_Trans_high_AtRisk": 159,
    "One_Trans_high_Lost": 1,
    "One_Trans_high_Potential": 29,
    "Rep_Brow": 1,
    "Rep_Trans_AtRisk": 76,
    "Rep_Trans_Lost": 67,
    "Rep_Trans_breakaway": 116,
    "Rep_Trans_high_AtRisk": 2,
    "Rep_Trans_high_breakaway": 12,
    "Rep_Trans_high_loyal": 521,
    "Rep_Trans_loyal": 296
    }

for i in range(10):

    def remove_duplicates(df, grouping_cols: list):
        df["index"] = vaex.vrange(0, df.shape[0])
        df_group = df.groupby(grouping_cols, agg=vaex.agg.min("index"))
        df = df.join(df_group[["index_min"]], left_on="index", right_on="index_min")
        df = df[df.index_min.notna()]
        df = df.drop(["index", "index_min"])
        df = df.extract()
        return df

    def calculateNewEngegementClassification(unique):
        eng_count = unique['eng_type_new1'].value_counts().to_dict()
        return eng_count

    df_new = vaex.open('./2301-value-counts-data/error_data.hdf5')
    dfInter = remove_duplicates(df_new,['tm_cid'])
    res = calculateNewEngegementClassification(dfInter)
    print('is it correct:', res == correct)

I tried it both in jupyter (restarting the kernel between tries) and normal python scripts.

ashsharma96 commented 1 year ago

@JovanVeljanoski Here is the vaex version I'm using: {'vaex': '4.9.1', 'vaex-core': '4.9.1', 'vaex-viz': '0.5.1', 'vaex-hdf5': '0.12.1', 'vaex-server': '0.8.1', 'vaex-astro': '0.9.1', 'vaex-jupyter': '0.7.0', 'vaex-ml': '0.17.0'}

JovanVeljanoski commented 1 year ago

That is quite a bit behind. Many issues were fixed since then. Please update to the latest version. Also, please answer the questions in the issue template otherwise we can't help.

ashsharma96 commented 1 year ago

@JovanVeljanoski Thank you for the quick reply. Sure from next time I'll keep this in mind. Any other details you needed from my side?

JovanVeljanoski commented 1 year ago

Yes, everything that we ask in the template..

maartenbreddels commented 1 year ago

Yeah, in 4.12 we fixed an issue in value_counts, see https://github.com/vaexio/vaex/blob/master/CHANGELOG.md#vaex-core-4120