Closed dugarte-vox closed 8 months ago
I tried joining both columns into an auxiliary one and I still get the same error.
df1['index_aux'] = (df1['index1']+'-'+df1['index2'].astype(str)).astype(str).copy()
df2['index_aux'] = (df2['index1']+'-'+df2['index2'].astype(str)).astype(str).copy()
df1.conditional_join(
df2,
('index_aux', 'index_aux', '=='),
('date', 'date_min', '>='),
('date', 'date_max', '<='),
)
Apologies for the late reply @dugarte-vox I'll have a look at this
What pandas version are you using @dugarte-vox ?
@dugarte-vox I believe this issue pops up when using pandas version >= 2.2, and has been fixed in pyjanitor. A new release will be out soon ,with the fix. In the mean time see if you can pandas version < 2.2.
@ericmjl ok to do a release?
Ok to do so! I trust your judgment on whether to do a patch or minor 😄.
@samukweku I think we should increase the cadence of releases; each PR merge can probably be considered a candidate for a new release, I think. Been trialling that with llamabot and I think it’ll be ok.
Llamabot! Can't wait to C it in action
Likely be a minor release
What pandas version are you using @dugarte-vox ?
Sorry for the late response. I'm currently using pandas 2.2.1. I've tried with a lower version of pandas (2.0.3) as you suggested and the example is working.
@ericmjl any suggestions on this failure? https://github.com/pyjanitor-devs/pyjanitor/actions/runs/8342259392
Yes, I think I know what's happening.
Could you do a hot fix push to main? It would be to change this line in GH Actions:
python setup.py sdist bdist_wheel
To this:
pip install -U build && python -m build -w -s
I added the pip install of build just in case. We should be using pyproject.toml now, is that right?
If not, the hotfix should be to add, one line above the setup.py line:
pip install -U setuptools
And that should do the trick.
If I were in your shoes, I would push to main directly, since this change is infra-related.
We haven't migrated to pyproject.yaml yet, need your infra expertise on the PR
No problem! I will take a look at it.
All done, @samukweku!
@dugarte-vox a new version of pyjanitor has been released. It should be fine on pandas > 2. Test and let's know if there are any issues. By the way, out of curiosity, what is your use case for conditional_join that regular pandas could not solve?
Thanks for the help!
By the way, out of curiosity, what is your use case for conditional_join that regular pandas could not solve?
Basically, I have a function that completes missing dates on a DataFrame with groups. This is an extract of that code:
import pandas as pd
import janitor
import random
df = pd.DataFrame({
'index1': ['A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B', 'C', 'C'],
'index2': [1, 1, 1, 1, 1, 1, 2, 2, 2, 1, 1],
'date': pd.to_datetime([
'2024-01-01', '2024-01-10', '2024-04-01',
'2024-01-01', '2024-01-10', '2024-04-01',
'2024-01-01', '2024-01-10', '2024-04-01',
'2021-10-01', '2021-11-01'
])
})
df['value'] = [random.randrange(1, 50, 1) for _ in range(df.shape[0])]
# Create min/max dates per group
df_group_dates = df.groupby(['index1', 'index2']).agg({'date': ['min', 'max']})
df_group_dates.columns = ["_".join(col) for col in df_group_dates.columns]
df_group_dates = df_group_dates.reset_index()
# Create a combination between all groups and all possible dates
date_min = df['date'].min()
date_max = df['date'].max()
date_range = pd.date_range(start=date_min, end=date_max, freq='D')
df_dates = pd.merge(
df[['index1', 'index2']].drop_duplicates(),
pd.DataFrame({'date': date_range}),
how='cross'
)
# Join combinations with min/max dates per group
df_dates = pd.merge(
df_dates,
df_group_dates[['index1', 'index2', 'date_min', 'date_max']],
how='left',
on=['index1', 'index2']
)
# Filter out date values outside the min/max range
df_dates = df_dates.loc[
(df_dates['date'] >= df_dates['date_min']) &
(df_dates['date'] <= df_dates['date_max']),
['date', 'index1', 'index2']
]
# Merge with original DataFrame
df = pd.merge(df, df_dates, on=['date', 'index1', 'index2'], how='right')
As you can see when I merge df_dates with df_group_dates I create a bigger DataFrame that is later filtered. This sometimes causes a memory error when the DataFrame is big.
Now, I can replace:
# Join combinations with min/max dates per group
df_dates = pd.merge(
df_dates,
df_group_dates[['index1', 'index2', 'date_min', 'date_max']],
how='left',
on=['index1', 'index2']
)
# Filter out date values outside the min/max range
df_dates = df_dates.loc[
(df_dates['date'] >= df_dates['date_min']) &
(df_dates['date'] <= df_dates['date_max']),
['date', 'index1', 'index2']
]
with:
# Join combinations with min/max date ranges per group
df_dates = df_dates.conditional_join(
df_group_dates,
('index1', 'index1', '=='),
('index2', 'index2', '=='),
('date', 'date_min', '>='),
('date', 'date_max', '<='),
how = 'inner'
).drop([('right', col) for col in ['index1', 'index2', 'date_min', 'date_max']], axis=1).droplevel(0, axis=1)
I haven't done any test to see if this is more efficient but, at least the code is smaller and more readable.
great response @dugarte-vox. It seems though that complete may fit in here for you - it makes missing rows explicit:
# pip install pyjanitor
(df
.complete(
{'date':lambda f: pd.date_range(f.date.min(), f.date.max(), freq='D')},
by=['index1','index2'])
)
The by
parameter ensures the columns are completed
per group. Under the hood it is just a for loop on the groups, which may help with memory issues, but may not be great performance wise. it is on my todo list to improve the performance for groups, i just havent had time to think it through, and really i wasnt sure anybody was using the function.
Another option, using complete
, which may be performant, is to use a variant of your solution, where you build a large dataframe and post filter (if memory allows):
grp = df.groupby(['index1','index2'],sort=False)
(df
.assign(
date_min=grp.date.transform('min'),
date_max=grp.date.transform('max'))
.complete(
('index1','index2','date_min','date_max'),
{'date':lambda f: pd.date_range(f.date.min(), f.date.max(), freq='D')})
.query('date_min<=date<=date_max')
)
appreciate the feedback, and if you have suggestions on how to make the functions better, the dev team are glad and always eager to get feedback.
Brief Description
I don't know if I'm doing this correctly but, I want to apply 2 equalities and 2 inequalities (dates). If I apply one equality it doesn't return an error but, when I apply both equalities an error is returned.
System Information
Minimally Reproducible Code
Error Messages
AttributeError: 'NoneType' object has no attribute 'size'