Open Hvass-Labs opened 5 years ago
So the core request is allowing upsampling in with a MultiIndex and level=
? If so, could you update your post to include a copy-pastable example with
And probably the rest of the post can be trimmed / removed (things like workarounds using groupby).
Is this something you could fix, because it would make it so much easier to upsample DataFrames with a MultiIndex?
It's likely not on anyones critical path. We can keep this issue open though in case someone interested in fixing it comes along.
Thanks for the quick reply. I realize the post was a bit long, but it was important to explain the reason behind the issue. It took me a long time to prepare the post so it was easy to understand the problem, and it will take me even more time to "trim it down", and considering you don't think it's a high-priority issue, I would prefer to spend my time on more important and productive things. Besides, the post contains all the information you ask for in a clearly laid out manner, if you take 5 minutes to read it.
Small comment: You have 400 open issues related to MultiIndex
which is about 13% of all your open issues for Pandas, so I think it would be wise of you to make MultiIndex
a high-priority issue, which includes issues about resampling. The problem seems to be that MultiIndex
was perhaps a bit poorly designed and implemented from the beginning, and then it was patched numerous times to try and fix various small problems. It might be wise to consider if MultiIndex
ought to be re-implemented from scratch, since it is causing so many problems.
OK. Was
So the core request is allowing upsampling in with a MultiIndex and level=?
A fair summary?
Well ... Allowing upsampling of a MultiIndex
with df.resample(level='something')
is my suggestion for an elegant solution, but I actually think I might have misunderstood the intended semantics of the level
arg in resample
(the docs are unclear to me). The problem is also bigger than just that, because it involves the semantics of groupby()
as shown in the examples. It is not so easy to summarize this entire problem into a single line, so I think my title summarizes it reasonably well, and I hope that the detailed explanation and examples will be of assistance to the core developer who will one day try and solve this problem, which is the reason I spent the time writing this.
The following is the workaround I will probably use. It is a wrapper function for upsampling either a Pandas DataFrame
or Series
, with either a DatetimeIndex
or a MultiIndex
. As you can see, it is a mess because Pandas has unclear / inconsistent / complicated semantics for upsampling a MultiIndex
.
def upsample(df, rule='D', method='ffill',
group_index=TICKER, date_index=REPORT_DATE):
"""
Upsample a Pandas DataFrame or Series with either a DatetimeIndex or
MultiIndex. This can be used to upsample financial data for a single
company, or upsample data for multiple companies in a single DataFrame.
:param df:
Pandas DataFrame or Series.
:param rule:
Upsampling frequency e.g. 'D' for daily.
This is passed directly to the Pandas resampler which has more options:
https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#dateoffset-objects
:param method:
String for the method of filling in empty values. Valid options:
'ffill' is forward-fill with last known values.
'linear' is linear interpolation between known values.
:param group_index:
If `df` has a MultiIndex then group companies using this data-column.
By default this is TICKER but it could also be e.g. SIMFIN_ID if
you are using that as an index in your DataFrame.
:param date_index:
If `df` has a MultiIndex then use this data-column as the dates.
By default this is REPORT_DATE but it could also be e.g. PUBLISH_DATE
if you are using that as an index in your DataFrame.
:return:
Upsampled DataFrame or Series.
"""
assert isinstance(df, (pd.DataFrame, pd.Series))
assert isinstance(df.index, (pd.DatetimeIndex, pd.MultiIndex))
assert method == 'ffill' or method == 'linear'
# Pandas 0.25.1 does not support upsampling a DataFrame with a MultiIndex
# using the normal resample() function, so we must handle the two cases
# differently.
# If the DataFrame has a DatetimeIndex.
if isinstance(df.index, pd.DatetimeIndex):
# Normal upsampling using Pandas.
if method == 'ffill':
# Fill with last-known value.
df_upsampled = df.resample(rule).ffill()
elif method == 'linear':
# Fill with linearly interpolated values.
df_upsampled = df.resample(rule).interpolate(method='linear')
# If the DataFrame has a MultiIndex.
elif isinstance(df.index, pd.MultiIndex):
# Pandas has very complicated semantics for resampling a DataFrame
# with a MultiIndex. The best way is apparently to group the DataFrame
# by companies (e.g. using TICKER) which creates an individual
# DataFrame for each company, and then apply the resampling to each
# of those DataFrames. It is further complicated by the need to reset
# and set the index. Pandas is quite poorly designed in this regard
# and its resampling API has already been changed several times.
# Helper-function for resampling a DataFrame for a single company.
def _resample(df):
if method == 'ffill':
# Fill with last-known value.
return df.set_index(date_index).resample(rule).ffill()
elif method == 'linear':
# Fill with linearly interpolated values.
return df.set_index(date_index).resample(rule).interpolate(method='linear')
# Group the original DataFrame by companies and apply the resampling to each.
df_upsampled = df.reset_index(level=date_index).groupby(level=group_index).apply(_resample)
return df_upsampled
We can now upsample a DataFrame
with a DatetimeIndex
, using the example data from above:
upsample(df=df.loc['MSFT'])
Output:
Revenue Net Income (Common)
Report Date
2008-06-30 60420000000 17681000000
2008-07-01 60420000000 17681000000
2008-07-02 60420000000 17681000000
2008-07-03 60420000000 17681000000
2008-07-04 60420000000 17681000000
... ... ...
2018-06-26 96571000000 25489000000
2018-06-27 96571000000 25489000000
2018-06-28 96571000000 25489000000
2018-06-29 96571000000 25489000000
2018-06-30 110360000000 16571000000
[3653 rows x 2 columns]
We can also upsample a DataFrame
with a MultiIndex
:
upsample(df=df)
Output:
Revenue Net Income (Common)
Ticker Report Date
AAPL 2007-09-30 24578000000 3495000000
2007-10-01 24578000000 3495000000
2007-10-02 24578000000 3495000000
2007-10-03 24578000000 3495000000
2007-10-04 24578000000 3495000000
... ... ...
MSFT 2018-06-26 96571000000 25489000000
2018-06-27 96571000000 25489000000
2018-06-28 96571000000 25489000000
2018-06-29 96571000000 25489000000
2018-06-30 110360000000 16571000000
[11691 rows x 2 columns]
Thanks for the workaround! However, a native solution would be great. The same problem exists for asfreq()... etc.
The solution I am using for now, and which might be of use to others, is to have my own helper-function:
def apply(df, func, group_index=TICKER):
"""
Apply a function to a Pandas DataFrame or Series with either a
DatetimeIndex or MultiIndex. This is useful when you don't know
whether a DataFrame contains data for a single or multiple stocks.
You write your function to work for a DataFrame with a single stock,
and this function lets you apply it to both DataFrames with a single
or multiple stocks. The function automatically uses Pandas groupby to
split-apply-merge on DataFrames with multiple stocks.
:param df:
Pandas DataFrame or Series assumed to have either a DatetimeIndex
or a MultiIndex with 2 indices, one of which is a DatetimeIndex
and the other is given by the arg `group_index`.
:param func:
Function to apply on a per-stock or per-group basis.
The function is assumed to be of the form:
def func(df_grp):
# df_grp is a Pandas DataFrame with data for a single stock.
# Perform some calculation on df_grp and create another
# Pandas DataFrame or Series with the result and return it.
# For example, we can calculate the cumulative sum:
return df_grp.cumsum()
:param group_index:
If `df` has a MultiIndex then group data using this index-column.
By default this is TICKER but it could also be e.g. SIMFIN_ID if
you are using that as an index in your DataFrame.
:return:
Pandas DataFrame or Series with the result of applying `func`.
"""
assert isinstance(df, (pd.DataFrame, pd.Series))
assert isinstance(df.index, (pd.DatetimeIndex, pd.MultiIndex))
# If the DataFrame has a DatetimeIndex.
if isinstance(df.index, pd.DatetimeIndex):
df_result = func(df)
# If the DataFrame has a MultiIndex.
elif isinstance(df.index, pd.MultiIndex):
# Helper-function for a DataFrame with a single group.
def _apply_group(df_grp):
# Remove group-index (e.g. TICKER) from the MultiIndex.
df_grp = df_grp.reset_index(group_index, drop=True)
# Perform the operation on this group.
df_grp_result = func(df_grp)
return df_grp_result
# Split the DataFrame into sub-groups and perform
# the operation on each sub-group and glue the
# results back together into a single DataFrame.
df_result = df.groupby(group_index).apply(_apply_group)
return df_result
It is then fairly simple to make functions that work for both DatetimeIndex
and MultiIndex
, for example:
def asfreq(df, freq, method=None, group_index=TICKER, **kwargs):
# Function to apply on a DataFrame with a single stock.
def _asfreq(df_grp):
return df_grp.asfreq(freq=freq, method=method, **kwargs)
# Apply the function and use groupby if DataFrame has multiple stocks.
df_result = apply(df=df, func=_asfreq, group_index=group_index)
return df_result
I have a bunch of these for resample
, reindex
, etc.
Same here.
I think the current pandas solution is quite complicated.
I'm not sure I'm seeing this problem or a different problem. I'm trying to get use groupby.resample on a multi-index with a datetime index but return a period index using resample 'kind' argument.
It works fine for the resample returning index as is:
factor_history.groupby(level=0).resample("Q", level=1).last()
Out[8]:
date ... operatingCashFlow_ttm_m
symbol fillingDate ...
ACIW 1993-12-31 1993-12-31 ... NaN
1994-03-31 1994-03-31 ... NaN
1994-06-30 1994-06-30 ... NaN
1994-09-30 1994-09-30 ... NaN
1994-12-31 1994-12-31 ... NaN
... ... ...
ZUO 2021-06-30 2021-04-30 ... 0.059599
2021-09-30 2021-07-31 ... 0.037492
2021-12-31 2021-10-31 ... 0.034160
2022-03-31 2022-01-31 ... 0.053891
2022-06-30 2022-04-30 ... 0.042874
[8760 rows x 209 columns]
But, when I add the kind='period" arg, I get this error:
`factor_history.groupby(level=0).resample("Q", level=1, kind='period').last()
Traceback (most recent call last):
File "C:\Users\lee\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 3444, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-9-b31b466d8c26>", line 1, in <module>
factor_history.groupby(level=0).resample("Q", level=1, kind='period').last()
File "C:\Users\lee\Anaconda3\lib\site-packages\pandas\core\resample.py", line 987, in f
return self._downsample(_method, min_count=min_count)
File "C:\Users\lee\Anaconda3\lib\site-packages\pandas\core\resample.py", line 1057, in _apply
return self._wrap_result(result)
File "C:\Users\lee\Anaconda3\lib\site-packages\pandas\core\resample.py", line 1218, in _wrap_result
result.index = result.index.to_period(self.freq)
AttributeError: 'MultiIndex' object has no attribute 'to_period'`
Hi! I would like this issue to be assigned to me. Thanks!
I have the same problem when I want to upsample a MultiIndex dataframe with pd.Grouper().
I have a panel data with two index columns : YearMonth means the date ,ticker means the company.
:
ticker YearMonth
000001 1996-05-31 -0.316089
1996-06-30 -0.298841
1996-07-31 -0.288128
1996-08-31 -0.280339
1996-09-30 -0.253262
...
605388 2020-12-31 -0.079680
605399 2020-09-30 -0.073434
2020-10-31 -0.075319
2020-11-30 -0.074510
2020-12-31 -0.064089
Name: tk, Length: 556120, dtype: float64
Now, I want to change the frequency of the data. But I find thatpd.Grouper
just work for down sample but upsample.
sample.groupby(['ticker',pd.Grouper(level = 'YearMonth',freq = 'Q')]).last()
it will give me the right output :
ticker YearMonth
000001 1996-06-30 -0.298841
1996-09-30 -0.253262
1996-12-31 -0.260421
1997-03-31 -0.251454
1997-06-30 -0.131571
...
605369 2020-12-31 -0.094853
605388 2020-09-30 -0.094796
2020-12-31 -0.079680
605399 2020-09-30 -0.073434
2020-12-31 -0.064089
Name: tk, Length: 189240, dtype: float64
sample.groupby(['ticker',pd.Grouper(level = 'YearMonth',freq = 'D')]).ffill()
it just return the raw data unchanged
ticker YearMonth
000001 1996-05-31 -0.316089
1996-06-30 -0.298841
1996-07-31 -0.288128
1996-08-31 -0.280339
1996-09-30 -0.253262
...
605388 2020-12-31 -0.079680
605399 2020-09-30 -0.073434
2020-10-31 -0.075319
2020-11-30 -0.074510
2020-12-31 -0.064089
Name: tk, Length: 556120, dtype: float64
Summary
Thanks for making Pandas I have used it in a lot of projects! But now I have a problem.
I have spent nearly 3 days trying to figure out how to resample / upsample a Pandas MultiIndex elegantly and correctly. I have read and tried numerous posts on StackOverflow and GitHub. My conclusion is that I don't think this is supported very well in Pandas. Let me explain what I want to do.
Background
I am currently building a Python API in collaboration with www.simfin.com that makes it very easy to download and use financial data (share-prices, fundamentals, etc.) for free. This will enable people to conduct and share financial research very easily. It works by downloading bulk-data in CSV files from the SimFin server and loading them in Pandas. The fundamental data such as Income Statements and Balance Sheets is usually indexed by the Ticker and Report Date which creates a Pandas DataFrame with a MultiIndex.
Data Example
Let us say we have a Pandas DataFrame
df
with this data:Resample a single Ticker (DatetimeIndex)
Let us first resample for a single ticker:
This works and the result is:
Resample multiple Tickers (MultiIndex)
Let us now try and resample for all tickers in the DataFrame. The
resample()
function takes an argumentlevel
which is supposed to work with a MultiIndex DataFrame:But this apparently doesn't work for upsampling e.g. annual data to daily data, because we get this error message:
One solution is to use
groupby()
(adapted from e.g. #13699):This works, but it now has duplicated the Ticker both as an index and as a column:
We can avoid one of them by adding the arg
group_keys=False
:This works, but now the Ticker is a data-column instead of an index:
To get the original MultiIndex back with both Ticker and Report Date, we need to do:
Which produces the desired result:
But this is so complicated that nobody can be expected to remember how to do it. So I would have to make a small helper-function that does all of this. But because the resampling method (pad, interpolate, etc.) is invoked through a function call on the groupby-object, my helper-function would get big and awkward if I want to allow different methods of resampling.
Conclusion
It appears that upsampling a MultiIndex is not supported very well in Pandas, unless I have misunderstood how it is supposed to work.
I think that by far the most elegant solution would be if the
resample()
function supported thelevel
argument for upsampling, because the syntax and semantics would be very similar for upsampling DatetimeIndex and MultiIndex:I have taken a look at the Pandas source-code, but it is complicated and so sparsely documented, that it would take me forever to figure out how everything is connected and how it works, so I don't think I will be able to fix this myself. Is this something you could fix, because it would make it so much easier to upsample DataFrames with a MultiIndex?
Thanks!