pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.68k stars 17.92k forks source link

BUG: Incorrect output in groupby(...).resample(...).ohlc() #53804

Open btodac opened 1 year ago

btodac commented 1 year ago

Pandas version checks

Reproducible Example

import pandas as pd
data = [0,1,2,3,4,5,6,7,8,9,10,11]
index = pd.date_range(
    pd.Timestamp("2000-01-01"),
    pd.Timestamp("2000-01-01") + pd.Timedelta(minutes=3),
    freq='16S',
    )

s = pd.Series(data,index)
s.resample('T').ohlc()

Issue Description

open high low close 2000-01-01 00:00:00 0 3 0 3 2000-01-01 00:01:00 4 7 4 7 2000-01-01 00:02:00 8 11 8 11

The open values for the second and third values are from the future: 2000-01-01 00:01:04 4 and 2000-01-01 00:02:08 8

Expected Behavior

The output should be: open high low close 2000-01-01 00:00:00 0 3 0 3 2000-01-01 00:01:00 3 7 4 7 2000-01-01 00:02:00 7 11 8 11 For all but the first open value the previous close value should be used

Installed Versions

INSTALLED VERSIONS ------------------ commit : 37ea63d540fd27274cad6585082c91b1283f963d python : 3.10.11.final.0 python-bits : 64 OS : Linux OS-release : 5.4.0-148-generic Version : #165-Ubuntu SMP Tue Apr 18 08:53:12 UTC 2023 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_GB.UTF-8 LOCALE : en_GB.UTF-8 pandas : 2.0.1 numpy : 1.24.3 pytz : 2023.3 dateutil : 2.8.2 setuptools : 67.7.2 pip : 23.1.2 Cython : None pytest : None hypothesis : None sphinx : 7.0.1 blosc : None feather : None xlsxwriter : None lxml.etree : 4.9.2 html5lib : 1.1 pymysql : None psycopg2 : None jinja2 : 3.1.2 IPython : 8.13.2 pandas_datareader: None bs4 : 4.12.2 bottleneck : None brotli : fastparquet : None fsspec : None gcsfs : None matplotlib : 3.7.1 numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pyreadstat : None pyxlsb : None s3fs : None scipy : 1.10.1 snappy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None zstandard : None tzdata : 2023.3 qtpy : 2.3.1 pyqt5 : None
btodac commented 1 year ago

I think the offending file is _libs/groupby.pyx. The ohlc function should look like:

@cython.wraparound(False)
@cython.boundscheck(False)
def group_ohlc(
    int64float_t[:, ::1] out,
    int64_t[::1] counts,
    ndarray[int64float_t, ndim=2] values,
    const intp_t[::1] labels,
    Py_ssize_t min_count=-1,
    const uint8_t[:, ::1] mask=None,
    uint8_t[:, ::1] result_mask=None,
) -> None:
    """
    Only aggregates on axis=0
    """
    cdef:
        Py_ssize_t i, N, K, lab
        int64float_t val, old_val
        uint8_t[::1] first_element_set
        bint isna_entry, uses_mask = mask is not None

    assert min_count == -1, "'min_count' only used in sum and prod"

    if len(labels) == 0:
        return

    N, K = (<object>values).shape

    if out.shape[1] != 4:
        raise ValueError("Output array must have 4 columns")

    if K > 1:
        raise NotImplementedError("Argument 'values' must have only one dimension")

    if int64float_t is float32_t or int64float_t is float64_t:
        out[:] = np.nan
    else:
        out[:] = 0

    first_element_set = np.zeros((<object>counts).shape, dtype=np.uint8)
    if uses_mask:
        result_mask[:] = True

    with nogil:
        for i in range(N):
            lab = labels[i]
            if lab == -1:
                continue

            counts[lab] += 1
            val = values[i, 0]

            if i == 0:
                old_val = val

            if uses_mask:
                isna_entry = mask[i, 0]
            else:
                isna_entry = _treat_as_na(val, False)

            if isna_entry:
                continue

            if not first_element_set[lab]:
                out[lab, 0] = old_val
                out[lab, 1] = max(old_val, val)
                out[lab, 2] = min(old_val, val)
                out[lab, 3] = val
                first_element_set[lab] = True
                if uses_mask:
                    result_mask[lab] = False
            else:
                out[lab, 1] = max(out[lab, 1], val)
                out[lab, 2] = min(out[lab, 2], val)
                out[lab, 3] = val
                old_val = val

This introduces a new variable called old_val that stores either the initial val if i==0 or the last val. Then at the start of the next element old_val is used to initialise the element. One issue that may occur is if a timestamp of val occurs at the exact time of the element label. I'm not sure what the common standard for this is (a<b or a<=b). However, this prevents the future data leaking back in time.

btodac commented 1 year ago

Looking at this solution with fresh eyes shows it has issues. If a data series represents a single continuous data period, e.g. the minute by minute data for a single day and the ohlc for a the day is generated it works correctly. If the data represents multiple days of minute by minute data then it will show the incorrect opening value for each day after the first day. Can this be corrected in the core/groupby/groupby.py function of ohlc? Using something like

    c = result['Close']
    c.index = c.index + pd.tseries.frequencies.to_offset(freq)
    index = c.index.intersection(result.index)
    result.loc[index,'Open'] = c.loc[index]

Where freq is the resampling rule and is less than a day. This ensures discontinuities in the data greater than freq are handled correctly, but if freq is one day or greater the current method should be applied.

rhshadrach commented 1 year ago

Thanks for the report @btodac; can you give this issue a more informative title?