Open btodac opened 1 year ago
I think the offending file is _libs/groupby.pyx. The ohlc function should look like:
@cython.wraparound(False)
@cython.boundscheck(False)
def group_ohlc(
int64float_t[:, ::1] out,
int64_t[::1] counts,
ndarray[int64float_t, ndim=2] values,
const intp_t[::1] labels,
Py_ssize_t min_count=-1,
const uint8_t[:, ::1] mask=None,
uint8_t[:, ::1] result_mask=None,
) -> None:
"""
Only aggregates on axis=0
"""
cdef:
Py_ssize_t i, N, K, lab
int64float_t val, old_val
uint8_t[::1] first_element_set
bint isna_entry, uses_mask = mask is not None
assert min_count == -1, "'min_count' only used in sum and prod"
if len(labels) == 0:
return
N, K = (<object>values).shape
if out.shape[1] != 4:
raise ValueError("Output array must have 4 columns")
if K > 1:
raise NotImplementedError("Argument 'values' must have only one dimension")
if int64float_t is float32_t or int64float_t is float64_t:
out[:] = np.nan
else:
out[:] = 0
first_element_set = np.zeros((<object>counts).shape, dtype=np.uint8)
if uses_mask:
result_mask[:] = True
with nogil:
for i in range(N):
lab = labels[i]
if lab == -1:
continue
counts[lab] += 1
val = values[i, 0]
if i == 0:
old_val = val
if uses_mask:
isna_entry = mask[i, 0]
else:
isna_entry = _treat_as_na(val, False)
if isna_entry:
continue
if not first_element_set[lab]:
out[lab, 0] = old_val
out[lab, 1] = max(old_val, val)
out[lab, 2] = min(old_val, val)
out[lab, 3] = val
first_element_set[lab] = True
if uses_mask:
result_mask[lab] = False
else:
out[lab, 1] = max(out[lab, 1], val)
out[lab, 2] = min(out[lab, 2], val)
out[lab, 3] = val
old_val = val
This introduces a new variable called old_val
that stores either the initial val
if i==0
or the last val
. Then at the start of the next element old_val
is used to initialise the element. One issue that may occur is if a timestamp of val
occurs at the exact time of the element label. I'm not sure what the common standard for this is (a<b or a<=b). However, this prevents the future data leaking back in time.
Looking at this solution with fresh eyes shows it has issues. If a data series represents a single continuous data period, e.g. the minute by minute data for a single day and the ohlc for a the day is generated it works correctly. If the data represents multiple days of minute by minute data then it will show the incorrect opening value for each day after the first day. Can this be corrected in the core/groupby/groupby.py function of ohlc? Using something like
c = result['Close']
c.index = c.index + pd.tseries.frequencies.to_offset(freq)
index = c.index.intersection(result.index)
result.loc[index,'Open'] = c.loc[index]
Where freq
is the resampling rule and is less than a day. This ensures discontinuities in the data greater than freq
are handled correctly, but if freq
is one day or greater the current method should be applied.
Thanks for the report @btodac; can you give this issue a more informative title?
Pandas version checks
[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of pandas.
[X] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Issue Description
open high low close
2000-01-01 00:00:00 0 3 0 3
2000-01-01 00:01:00 4 7 4 7
2000-01-01 00:02:00 8 11 8 11
The open values for the second and third values are from the future:
2000-01-01 00:01:04 4
and2000-01-01 00:02:08 8
Expected Behavior
The output should be:
open high low close
2000-01-01 00:00:00 0 3 0 3
2000-01-01 00:01:00 3 7 4 7
2000-01-01 00:02:00 7 11 8 11
For all but the first open value the previous close value should be usedInstalled Versions