pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.86k stars 18.01k forks source link

Problem with MultiIndex upsampling / resampling #28313

Open Hvass-Labs opened 5 years ago

Hvass-Labs commented 5 years ago

Summary

Thanks for making Pandas I have used it in a lot of projects! But now I have a problem.

I have spent nearly 3 days trying to figure out how to resample / upsample a Pandas MultiIndex elegantly and correctly. I have read and tried numerous posts on StackOverflow and GitHub. My conclusion is that I don't think this is supported very well in Pandas. Let me explain what I want to do.

Background

I am currently building a Python API in collaboration with www.simfin.com that makes it very easy to download and use financial data (share-prices, fundamentals, etc.) for free. This will enable people to conduct and share financial research very easily. It works by downloading bulk-data in CSV files from the SimFin server and loading them in Pandas. The fundamental data such as Income Statements and Balance Sheets is usually indexed by the Ticker and Report Date which creates a Pandas DataFrame with a MultiIndex.

Data Example

Let us say we have a Pandas DataFrame df with this data:

                         Revenue  Net Income (Common)
Ticker Report Date                                   
AAPL   2007-09-30    24578000000           3495000000
       2008-09-30    37491000000           6119000000
       2009-09-30    42905000000           8235000000
       2010-09-30    65225000000          14013000000
       2011-09-30   108249000000          25922000000
       2012-09-30   156508000000          41733000000
       2013-09-30   170910000000          37037000000
       2014-09-30   182795000000          39510000000
       2015-09-30   233715000000          53394000000
       2016-09-30   215639000000          45687000000
       2017-09-30   229234000000          48351000000
       2018-09-30   265595000000          59531000000
AMZN   2007-12-31    14835000000            476000000
       2008-12-31    19166000000            645000000
       2009-12-31    24509000000            902000000
       2010-12-31    34204000000           1152000000
       2011-12-31    48077000000            631000000
       2012-12-31    61093000000            -39000000
       2013-12-31    74452000000            274000000
       2014-12-31    88988000000           -241000000
       2015-12-31   107006000000            596000000
       2016-12-31   135987000000           2371000000
       2017-12-31   177866000000           3033000000
       2018-12-31   232887000000          10073000000
MSFT   2008-06-30    60420000000          17681000000
       2009-06-30    58437000000          14569000000
       2010-06-30    62484000000          18760000000
       2011-06-30    69943000000          23150000000
       2012-06-30    73723000000          16978000000
       2013-06-30    77849000000          21863000000
       2014-06-30    86833000000          22074000000
       2015-06-30    93580000000          12193000000
       2016-06-30    91154000000          20539000000
       2017-06-30    96571000000          25489000000
       2018-06-30   110360000000          16571000000

Resample a single Ticker (DatetimeIndex)

Let us first resample for a single ticker:

df.loc['MSFT'].resample('D').pad()

This works and the result is:

                 Revenue  Net Income (Common)
Report Date                                  
2008-06-30   60420000000          17681000000
2008-07-01   60420000000          17681000000
2008-07-02   60420000000          17681000000
2008-07-03   60420000000          17681000000
2008-07-04   60420000000          17681000000

Resample multiple Tickers (MultiIndex)

Let us now try and resample for all tickers in the DataFrame. The resample() function takes an argument level which is supposed to work with a MultiIndex DataFrame:

df.resample('D', level='Report Date').pad()

But this apparently doesn't work for upsampling e.g. annual data to daily data, because we get this error message:

ValueError: Upsampling from level= or on= selection is not supported, use .set_index(...) to explicitly set index to datetime-like

One solution is to use groupby() (adapted from e.g. #13699):

df.reset_index('Ticker').groupby('Ticker').resample('D').pad()

This works, but it now has duplicated the Ticker both as an index and as a column:

                   Ticker      Revenue  Net Income (Common)
Ticker Report Date                                         
AAPL   2007-09-30    AAPL  24578000000           3495000000
       2007-10-01    AAPL  24578000000           3495000000
       2007-10-02    AAPL  24578000000           3495000000
       2007-10-03    AAPL  24578000000           3495000000
       2007-10-04    AAPL  24578000000           3495000000

We can avoid one of them by adding the arg group_keys=False:

df.reset_index('Ticker').groupby('Ticker', group_keys=False).resample('D').pad()

This works, but now the Ticker is a data-column instead of an index:

            Ticker      Revenue  Net Income (Common)
Report Date                                         
2007-09-30    AAPL  24578000000           3495000000
2007-10-01    AAPL  24578000000           3495000000
2007-10-02    AAPL  24578000000           3495000000
2007-10-03    AAPL  24578000000           3495000000
2007-10-04    AAPL  24578000000           3495000000

To get the original MultiIndex back with both Ticker and Report Date, we need to do:

df.reset_index('Ticker').groupby('Ticker', group_keys=False).resample('D').pad().reset_index().set_index(['Ticker', 'Report Date'])

Which produces the desired result:

                        Revenue  Net Income (Common)
Ticker Report Date                                  
AAPL   2007-09-30   24578000000           3495000000
       2007-10-01   24578000000           3495000000
       2007-10-02   24578000000           3495000000
       2007-10-03   24578000000           3495000000
       2007-10-04   24578000000           3495000000

But this is so complicated that nobody can be expected to remember how to do it. So I would have to make a small helper-function that does all of this. But because the resampling method (pad, interpolate, etc.) is invoked through a function call on the groupby-object, my helper-function would get big and awkward if I want to allow different methods of resampling.

Conclusion

It appears that upsampling a MultiIndex is not supported very well in Pandas, unless I have misunderstood how it is supposed to work.

I think that by far the most elegant solution would be if the resample() function supported the level argument for upsampling, because the syntax and semantics would be very similar for upsampling DatetimeIndex and MultiIndex:

# DatetimeIndex
df.loc['MSFT'].resample('D').pad()

# MultiIndex
df.resample('D', level='Report Date').pad()

I have taken a look at the Pandas source-code, but it is complicated and so sparsely documented, that it would take me forever to figure out how everything is connected and how it works, so I don't think I will be able to fix this myself. Is this something you could fix, because it would make it so much easier to upsample DataFrames with a MultiIndex?

Thanks!

INSTALLED VERSIONS ------------------ commit : None python : 3.6.8.final.0 python-bits : 64 OS : Linux OS-release : 4.15.0-60-generic machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 0.25.1 numpy : 1.16.4 pytz : 2019.2 dateutil : 2.8.0 pip : 19.1.1 setuptools : 41.0.1 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : 2.10.1 IPython : 7.5.0 pandas_datareader: None bs4 : None bottleneck : None fastparquet : None gcsfs : None lxml.etree : None matplotlib : 3.1.0 numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pytables : None s3fs : None scipy : 1.3.1 sqlalchemy : None tables : None xarray : None xlrd : None xlwt : None xlsxwriter : None
TomAugspurger commented 5 years ago

So the core request is allowing upsampling in with a MultiIndex and level=? If so, could you update your post to include a copy-pastable example with

And probably the rest of the post can be trimmed / removed (things like workarounds using groupby).

Is this something you could fix, because it would make it so much easier to upsample DataFrames with a MultiIndex?

It's likely not on anyones critical path. We can keep this issue open though in case someone interested in fixing it comes along.

Hvass-Labs commented 5 years ago

Thanks for the quick reply. I realize the post was a bit long, but it was important to explain the reason behind the issue. It took me a long time to prepare the post so it was easy to understand the problem, and it will take me even more time to "trim it down", and considering you don't think it's a high-priority issue, I would prefer to spend my time on more important and productive things. Besides, the post contains all the information you ask for in a clearly laid out manner, if you take 5 minutes to read it.

Small comment: You have 400 open issues related to MultiIndex which is about 13% of all your open issues for Pandas, so I think it would be wise of you to make MultiIndex a high-priority issue, which includes issues about resampling. The problem seems to be that MultiIndex was perhaps a bit poorly designed and implemented from the beginning, and then it was patched numerous times to try and fix various small problems. It might be wise to consider if MultiIndex ought to be re-implemented from scratch, since it is causing so many problems.

TomAugspurger commented 5 years ago

OK. Was

So the core request is allowing upsampling in with a MultiIndex and level=?

A fair summary?

Hvass-Labs commented 5 years ago

Well ... Allowing upsampling of a MultiIndex with df.resample(level='something') is my suggestion for an elegant solution, but I actually think I might have misunderstood the intended semantics of the level arg in resample (the docs are unclear to me). The problem is also bigger than just that, because it involves the semantics of groupby() as shown in the examples. It is not so easy to summarize this entire problem into a single line, so I think my title summarizes it reasonably well, and I hope that the detailed explanation and examples will be of assistance to the core developer who will one day try and solve this problem, which is the reason I spent the time writing this.

Hvass-Labs commented 5 years ago

The following is the workaround I will probably use. It is a wrapper function for upsampling either a Pandas DataFrame or Series, with either a DatetimeIndex or a MultiIndex. As you can see, it is a mess because Pandas has unclear / inconsistent / complicated semantics for upsampling a MultiIndex.

def upsample(df, rule='D', method='ffill',
             group_index=TICKER, date_index=REPORT_DATE):
    """
    Upsample a Pandas DataFrame or Series with either a DatetimeIndex or
    MultiIndex. This can be used to upsample financial data for a single
    company, or upsample data for multiple companies in a single DataFrame.

    :param df:
        Pandas DataFrame or Series.

    :param rule:
        Upsampling frequency e.g. 'D' for daily.

        This is passed directly to the Pandas resampler which has more options:
        https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#dateoffset-objects

    :param method:
        String for the method of filling in empty values. Valid options:
        'ffill' is forward-fill with last known values.
        'linear' is linear interpolation between known values.

    :param group_index:
        If `df` has a MultiIndex then group companies using this data-column.
        By default this is TICKER but it could also be e.g. SIMFIN_ID if
        you are using that as an index in your DataFrame.

    :param date_index:
        If `df` has a MultiIndex then use this data-column as the dates.
        By default this is REPORT_DATE but it could also be e.g. PUBLISH_DATE
        if you are using that as an index in your DataFrame.

    :return:
        Upsampled DataFrame or Series.
    """

    assert isinstance(df, (pd.DataFrame, pd.Series))
    assert isinstance(df.index, (pd.DatetimeIndex, pd.MultiIndex))
    assert method == 'ffill' or method == 'linear'

    # Pandas 0.25.1 does not support upsampling a DataFrame with a MultiIndex
    # using the normal resample() function, so we must handle the two cases
    # differently.

    # If the DataFrame has a DatetimeIndex.
    if isinstance(df.index, pd.DatetimeIndex):
        # Normal upsampling using Pandas.

        if method == 'ffill':
            # Fill with last-known value.
            df_upsampled = df.resample(rule).ffill()
        elif method == 'linear':
            # Fill with linearly interpolated values.
            df_upsampled = df.resample(rule).interpolate(method='linear')

    # If the DataFrame has a MultiIndex.
    elif isinstance(df.index, pd.MultiIndex):
        # Pandas has very complicated semantics for resampling a DataFrame
        # with a MultiIndex. The best way is apparently to group the DataFrame
        # by companies (e.g. using TICKER) which creates an individual
        # DataFrame for each company, and then apply the resampling to each
        # of those DataFrames. It is further complicated by the need to reset
        # and set the index. Pandas is quite poorly designed in this regard
        # and its resampling API has already been changed several times.

        # Helper-function for resampling a DataFrame for a single company.
        def _resample(df):
            if method == 'ffill':
                # Fill with last-known value.
                return df.set_index(date_index).resample(rule).ffill()
            elif method == 'linear':
                # Fill with linearly interpolated values.
                return df.set_index(date_index).resample(rule).interpolate(method='linear')

        # Group the original DataFrame by companies and apply the resampling to each.
        df_upsampled = df.reset_index(level=date_index).groupby(level=group_index).apply(_resample)

    return df_upsampled

We can now upsample a DataFrame with a DatetimeIndex, using the example data from above:

upsample(df=df.loc['MSFT'])

Output:

                  Revenue  Net Income (Common)
Report Date                                   
2008-06-30    60420000000          17681000000
2008-07-01    60420000000          17681000000
2008-07-02    60420000000          17681000000
2008-07-03    60420000000          17681000000
2008-07-04    60420000000          17681000000
...                   ...                  ...
2018-06-26    96571000000          25489000000
2018-06-27    96571000000          25489000000
2018-06-28    96571000000          25489000000
2018-06-29    96571000000          25489000000
2018-06-30   110360000000          16571000000

[3653 rows x 2 columns]

We can also upsample a DataFrame with a MultiIndex:

upsample(df=df)

Output:

                         Revenue  Net Income (Common)
Ticker Report Date                                   
AAPL   2007-09-30    24578000000           3495000000
       2007-10-01    24578000000           3495000000
       2007-10-02    24578000000           3495000000
       2007-10-03    24578000000           3495000000
       2007-10-04    24578000000           3495000000
...                          ...                  ...
MSFT   2018-06-26    96571000000          25489000000
       2018-06-27    96571000000          25489000000
       2018-06-28    96571000000          25489000000
       2018-06-29    96571000000          25489000000
       2018-06-30   110360000000          16571000000

[11691 rows x 2 columns]
KeBachmann commented 5 years ago

Thanks for the workaround! However, a native solution would be great. The same problem exists for asfreq()... etc.

Hvass-Labs commented 5 years ago

The solution I am using for now, and which might be of use to others, is to have my own helper-function:

def apply(df, func, group_index=TICKER):
    """
    Apply a function to a Pandas DataFrame or Series with either a
    DatetimeIndex or MultiIndex. This is useful when you don't know
    whether a DataFrame contains data for a single or multiple stocks.

    You write your function to work for a DataFrame with a single stock,
    and this function lets you apply it to both DataFrames with a single
    or multiple stocks. The function automatically uses Pandas groupby to
    split-apply-merge on DataFrames with multiple stocks.

    :param df:
        Pandas DataFrame or Series assumed to have either a DatetimeIndex
        or a MultiIndex with 2 indices, one of which is a DatetimeIndex
        and the other is given by the arg `group_index`.

    :param func:
        Function to apply on a per-stock or per-group basis.
        The function is assumed to be of the form:

            def func(df_grp):
                # df_grp is a Pandas DataFrame with data for a single stock.
                # Perform some calculation on df_grp and create another
                # Pandas DataFrame or Series with the result and return it.
                # For example, we can calculate the cumulative sum:
                return df_grp.cumsum()

    :param group_index:
        If `df` has a MultiIndex then group data using this index-column.
        By default this is TICKER but it could also be e.g. SIMFIN_ID if
        you are using that as an index in your DataFrame.

    :return:
        Pandas DataFrame or Series with the result of applying `func`.
    """

    assert isinstance(df, (pd.DataFrame, pd.Series))
    assert isinstance(df.index, (pd.DatetimeIndex, pd.MultiIndex))

    # If the DataFrame has a DatetimeIndex.
    if isinstance(df.index, pd.DatetimeIndex):
        df_result = func(df)

    # If the DataFrame has a MultiIndex.
    elif isinstance(df.index, pd.MultiIndex):
        # Helper-function for a DataFrame with a single group.
        def _apply_group(df_grp):
            # Remove group-index (e.g. TICKER) from the MultiIndex.
            df_grp = df_grp.reset_index(group_index, drop=True)

            # Perform the operation on this group.
            df_grp_result = func(df_grp)

            return df_grp_result

        # Split the DataFrame into sub-groups and perform
        # the operation on each sub-group and glue the
        # results back together into a single DataFrame.
        df_result = df.groupby(group_index).apply(_apply_group)

    return df_result

It is then fairly simple to make functions that work for both DatetimeIndex and MultiIndex, for example:

def asfreq(df, freq, method=None, group_index=TICKER, **kwargs):

    # Function to apply on a DataFrame with a single stock.
    def _asfreq(df_grp):
        return df_grp.asfreq(freq=freq, method=method, **kwargs)

    # Apply the function and use groupby if DataFrame has multiple stocks.
    df_result = apply(df=df, func=_asfreq, group_index=group_index)

    return df_result

I have a bunch of these for resample, reindex, etc.

andreas-vester commented 4 years ago

Same here.

I think the current pandas solution is quite complicated.

leeprevost commented 2 years ago

I'm not sure I'm seeing this problem or a different problem. I'm trying to get use groupby.resample on a multi-index with a datetime index but return a period index using resample 'kind' argument.

It works fine for the resample returning index as is:

factor_history.groupby(level=0).resample("Q", level=1).last()
Out[8]: 
                         date  ... operatingCashFlow_ttm_m
symbol fillingDate             ...                        
ACIW   1993-12-31  1993-12-31  ...                     NaN
       1994-03-31  1994-03-31  ...                     NaN
       1994-06-30  1994-06-30  ...                     NaN
       1994-09-30  1994-09-30  ...                     NaN
       1994-12-31  1994-12-31  ...                     NaN
                       ...  ...                     ...
ZUO    2021-06-30  2021-04-30  ...                0.059599
       2021-09-30  2021-07-31  ...                0.037492
       2021-12-31  2021-10-31  ...                0.034160
       2022-03-31  2022-01-31  ...                0.053891
       2022-06-30  2022-04-30  ...                0.042874
[8760 rows x 209 columns]

But, when I add the kind='period" arg, I get this error:

`factor_history.groupby(level=0).resample("Q", level=1, kind='period').last()
Traceback (most recent call last):
  File "C:\Users\lee\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 3444, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-9-b31b466d8c26>", line 1, in <module>
    factor_history.groupby(level=0).resample("Q", level=1, kind='period').last()
  File "C:\Users\lee\Anaconda3\lib\site-packages\pandas\core\resample.py", line 987, in f
    return self._downsample(_method, min_count=min_count)
  File "C:\Users\lee\Anaconda3\lib\site-packages\pandas\core\resample.py", line 1057, in _apply
    return self._wrap_result(result)
  File "C:\Users\lee\Anaconda3\lib\site-packages\pandas\core\resample.py", line 1218, in _wrap_result
    result.index = result.index.to_period(self.freq)
AttributeError: 'MultiIndex' object has no attribute 'to_period'`
GivyBoy commented 2 years ago

Hi! I would like this issue to be assigned to me. Thanks!

XaineChaeson commented 1 year ago

I have the same problem when I want to upsample a MultiIndex dataframe with pd.Grouper(). I have a panel data with two index columns : YearMonth means the date ,ticker means the company.
:

ticker  YearMonth 
000001  1996-05-31   -0.316089
        1996-06-30   -0.298841
        1996-07-31   -0.288128
        1996-08-31   -0.280339
        1996-09-30   -0.253262
                        ...   
605388  2020-12-31   -0.079680
605399  2020-09-30   -0.073434
        2020-10-31   -0.075319
        2020-11-30   -0.074510
        2020-12-31   -0.064089
Name: tk, Length: 556120, dtype: float64

Now, I want to change the frequency of the data. But I find thatpd.Grouper just work for down sample but upsample.

sample.groupby(['ticker',pd.Grouper(level = 'YearMonth',freq = 'Q')]).last() it will give me the right output :

ticker  YearMonth 
000001  1996-06-30   -0.298841
        1996-09-30   -0.253262
        1996-12-31   -0.260421
        1997-03-31   -0.251454
        1997-06-30   -0.131571
                        ...   
605369  2020-12-31   -0.094853
605388  2020-09-30   -0.094796
        2020-12-31   -0.079680
605399  2020-09-30   -0.073434
        2020-12-31   -0.064089
Name: tk, Length: 189240, dtype: float64

sample.groupby(['ticker',pd.Grouper(level = 'YearMonth',freq = 'D')]).ffill() it just return the raw data unchanged

ticker  YearMonth 
000001  1996-05-31   -0.316089
        1996-06-30   -0.298841
        1996-07-31   -0.288128
        1996-08-31   -0.280339
        1996-09-30   -0.253262
                        ...   
605388  2020-12-31   -0.079680
605399  2020-09-30   -0.073434
        2020-10-31   -0.075319
        2020-11-30   -0.074510
        2020-12-31   -0.064089
Name: tk, Length: 556120, dtype: float64