pyjanitor-devs / pyjanitor

Clean APIs for data cleaning. Python implementation of R package Janitor
https://pyjanitor-devs.github.io/pyjanitor
MIT License
1.37k stars 171 forks source link

Problems with equalities in contional_join #1342

Closed dugarte-vox closed 8 months ago

dugarte-vox commented 9 months ago

Brief Description

I don't know if I'm doing this correctly but, I want to apply 2 equalities and 2 inequalities (dates). If I apply one equality it doesn't return an error but, when I apply both equalities an error is returned.

System Information

Minimally Reproducible Code

import pandas as pd
import janitor
import random

df1 = pd.DataFrame({
    'index1': ['A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B', 'C', 'C'],
    'index2': [1, 1, 1, 1, 1, 1, 2, 2, 2, 1, 1],
    'date': pd.to_datetime([
        '2024-01-01', '2024-01-10', '2024-04-01',
        '2024-01-01', '2024-01-10', '2024-04-01',
        '2024-01-01', '2024-01-10', '2024-04-01',
        '2021-10-01', '2021-11-01'
    ])
})
df1['value'] = [random.randrange(1, 50, 1) for i in range(df1.shape[0])]

df2 = pd.DataFrame({
    'index1': ['A', 'B', 'B', 'C'],
    'index2': [1, 1, 2, 1],
    'date_min': pd.to_datetime(['2024-01-01', '2024-01-01', '2024-01-01', '2021-10-01']),
    'date_max': pd.to_datetime(['2024-01-30', '2024-01-30', '2024-01-30', '2021-10-30']),
})

df1.conditional_join(
    df2,         
    ('index1', 'index1', '=='), 
    ('index2', 'index2', '=='),
    ('date', 'date_min', '>='), 
    ('date', 'date_max', '<='), 
)

Error Messages

AttributeError: 'NoneType' object has no attribute 'size'

dugarte-vox commented 9 months ago

I tried joining both columns into an auxiliary one and I still get the same error.

df1['index_aux'] = (df1['index1']+'-'+df1['index2'].astype(str)).astype(str).copy()
df2['index_aux'] = (df2['index1']+'-'+df2['index2'].astype(str)).astype(str).copy()

df1.conditional_join(
    df2,         
    ('index_aux', 'index_aux', '=='),
    ('date', 'date_min', '>='), 
    ('date', 'date_max', '<='), 
)
samukweku commented 8 months ago

Apologies for the late reply @dugarte-vox I'll have a look at this

samukweku commented 8 months ago

What pandas version are you using @dugarte-vox ?

samukweku commented 8 months ago

@dugarte-vox I believe this issue pops up when using pandas version >= 2.2, and has been fixed in pyjanitor. A new release will be out soon ,with the fix. In the mean time see if you can pandas version < 2.2.

samukweku commented 8 months ago

@ericmjl ok to do a release?

ericmjl commented 8 months ago

Ok to do so! I trust your judgment on whether to do a patch or minor 😄.

ericmjl commented 8 months ago

@samukweku I think we should increase the cadence of releases; each PR merge can probably be considered a candidate for a new release, I think. Been trialling that with llamabot and I think it’ll be ok.

samukweku commented 8 months ago

Llamabot! Can't wait to C it in action

samukweku commented 8 months ago

Likely be a minor release

dugarte-vox commented 8 months ago

What pandas version are you using @dugarte-vox ?

Sorry for the late response. I'm currently using pandas 2.2.1. I've tried with a lower version of pandas (2.0.3) as you suggested and the example is working.

samukweku commented 8 months ago

@ericmjl any suggestions on this failure? https://github.com/pyjanitor-devs/pyjanitor/actions/runs/8342259392

ericmjl commented 8 months ago

Yes, I think I know what's happening.

Could you do a hot fix push to main? It would be to change this line in GH Actions:

python setup.py sdist bdist_wheel

To this:

pip install -U build && python -m build -w -s

I added the pip install of build just in case. We should be using pyproject.toml now, is that right?

If not, the hotfix should be to add, one line above the setup.py line:

pip install -U setuptools

And that should do the trick.

If I were in your shoes, I would push to main directly, since this change is infra-related.

samukweku commented 8 months ago

We haven't migrated to pyproject.yaml yet, need your infra expertise on the PR

ericmjl commented 8 months ago

No problem! I will take a look at it.

ericmjl commented 8 months ago

All done, @samukweku!

samukweku commented 8 months ago

@dugarte-vox a new version of pyjanitor has been released. It should be fine on pandas > 2. Test and let's know if there are any issues. By the way, out of curiosity, what is your use case for conditional_join that regular pandas could not solve?

dugarte-vox commented 8 months ago

Thanks for the help!

By the way, out of curiosity, what is your use case for conditional_join that regular pandas could not solve?

Basically, I have a function that completes missing dates on a DataFrame with groups. This is an extract of that code:

import pandas as pd
import janitor
import random

df = pd.DataFrame({
    'index1': ['A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B', 'C', 'C'],
    'index2': [1, 1, 1, 1, 1, 1, 2, 2, 2, 1, 1],
    'date': pd.to_datetime([
        '2024-01-01', '2024-01-10', '2024-04-01',
        '2024-01-01', '2024-01-10', '2024-04-01',
        '2024-01-01', '2024-01-10', '2024-04-01',
        '2021-10-01', '2021-11-01'
    ])
})
df['value'] = [random.randrange(1, 50, 1) for _ in range(df.shape[0])]

# Create min/max dates per group
df_group_dates = df.groupby(['index1', 'index2']).agg({'date': ['min', 'max']})
df_group_dates.columns = ["_".join(col) for col in df_group_dates.columns]
df_group_dates = df_group_dates.reset_index()

# Create a combination between all groups and all possible dates
date_min = df['date'].min()
date_max = df['date'].max()
date_range = pd.date_range(start=date_min, end=date_max, freq='D')
df_dates = pd.merge(
    df[['index1', 'index2']].drop_duplicates(),
    pd.DataFrame({'date': date_range}),
    how='cross'
)

# Join combinations with min/max dates per group
df_dates = pd.merge(
    df_dates, 
    df_group_dates[['index1', 'index2', 'date_min', 'date_max']],
    how='left',
    on=['index1', 'index2']
)
# Filter out date values outside the min/max range
df_dates = df_dates.loc[
    (df_dates['date'] >= df_dates['date_min']) &                      
    (df_dates['date'] <= df_dates['date_max']),
    ['date', 'index1', 'index2']
]

# Merge with original DataFrame
df = pd.merge(df, df_dates, on=['date', 'index1', 'index2'], how='right')

As you can see when I merge df_dates with df_group_dates I create a bigger DataFrame that is later filtered. This sometimes causes a memory error when the DataFrame is big.

Now, I can replace:

# Join combinations with min/max dates per group
df_dates = pd.merge(
    df_dates, 
    df_group_dates[['index1', 'index2', 'date_min', 'date_max']],
    how='left',
    on=['index1', 'index2']
)
# Filter out date values outside the min/max range
df_dates = df_dates.loc[
    (df_dates['date'] >= df_dates['date_min']) &                      
    (df_dates['date'] <= df_dates['date_max']),
    ['date', 'index1', 'index2']
]

with:

# Join combinations with min/max date ranges per group 
df_dates = df_dates.conditional_join(
    df_group_dates,         
    ('index1', 'index1', '=='),
    ('index2', 'index2', '=='),
    ('date', 'date_min', '>='), 
    ('date', 'date_max', '<='), 
    how = 'inner'
).drop([('right', col) for col in ['index1', 'index2', 'date_min', 'date_max']], axis=1).droplevel(0, axis=1)

I haven't done any test to see if this is more efficient but, at least the code is smaller and more readable.

samukweku commented 8 months ago

great response @dugarte-vox. It seems though that complete may fit in here for you - it makes missing rows explicit:

# pip install pyjanitor
(df
.complete(
    {'date':lambda f: pd.date_range(f.date.min(), f.date.max(), freq='D')},
    by=['index1','index2'])
)

The by parameter ensures the columns are completed per group. Under the hood it is just a for loop on the groups, which may help with memory issues, but may not be great performance wise. it is on my todo list to improve the performance for groups, i just havent had time to think it through, and really i wasnt sure anybody was using the function.

Another option, using complete, which may be performant, is to use a variant of your solution, where you build a large dataframe and post filter (if memory allows):

grp = df.groupby(['index1','index2'],sort=False)
(df
.assign(
    date_min=grp.date.transform('min'),
    date_max=grp.date.transform('max'))
.complete(
    ('index1','index2','date_min','date_max'),
    {'date':lambda f: pd.date_range(f.date.min(), f.date.max(), freq='D')})
.query('date_min<=date<=date_max')
)

appreciate the feedback, and if you have suggestions on how to make the functions better, the dev team are glad and always eager to get feedback.