pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.73k stars 17.95k forks source link

BUG: TimeDeltaIndex slicing with strings incorrectly includes too much data #33603

Open pbotros opened 4 years ago

pbotros commented 4 years ago

Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

import numpy as np
import pandas as pd
fs = 50000
i = pd.to_timedelta(np.arange(900 * fs, dtype=np.int) / fs * 1e9, unit='ns')
df = pd.DataFrame({'dummy': np.arange(len(i))}, index=i)
assert np.isclose(len(df['710s':'711s']) / fs, 2.0)
assert np.isclose(len(df['710s':'719s']) / fs, 10.0)
assert np.isclose(len(df['610s':'620s']) / fs, 11.0)
assert np.isclose(len(df['710s':'720.00000001s']) / fs, 10.00002)
assert np.isclose(len(df['710s':'720s']) / fs, 11.0)  # fails! Slices 70 seconds of data??

Problem description

Slicing a dataframe with a TimeDeltaIndex with the particular right bound '720s' seems to be incorrectly parsed, not returning the time slice as expected. As can be seen in the above example, other bounds work as expected, but using '720s' as the right bound returns 60 more seconds of data than it should have.

Expected Output

Slicing between '710s' and '720s' should return 11 seconds of data, as slicing '610s' and '620s' does.

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit : None python : 3.7.3.final.0 python-bits : 64 OS : Linux OS-release : 5.0.0-29-generic machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 1.0.3 numpy : 1.18.2 pytz : 2019.3 dateutil : 2.8.1 pip : 9.0.1 setuptools : 46.1.3 Cython : None pytest : 4.3.1 hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : 0.999999999 pymysql : 0.9.3 psycopg2 : None jinja2 : 2.10.3 IPython : None pandas_datareader: None bs4 : None bottleneck : None fastparquet : None gcsfs : None lxml.etree : None matplotlib : 3.2.1 numexpr : None odfpy : None openpyxl : 2.4.11 pandas_gbq : None pyarrow : 0.13.0 pytables : None pytest : 4.3.1 pyxlsb : None s3fs : None scipy : 1.4.1 sqlalchemy : 1.3.12 tables : None tabulate : 0.8.6 xarray : None xlrd : 1.2.0 xlwt : 1.3.0 xlsxwriter : None numba : 0.48.0
mroeschke commented 4 years ago

Actually I think this is the intended behavior.

I think your example shows partial string indexing for a TimedeltaIndex. So since the right edge of '720s' is 12 minutes, the result returns all results with a minute component of 12 minutes. Likewise since '620s' is 10 minutes 20 seconds, the result returns all results with a minute component of 10 and seconds component of 20.

I definitely think this can be better documented here: https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#indexing

mattbit commented 4 years ago

@mroeschke I think this is related to #21186. While I do see the point in having partial string indexing for datetimes, in the case of TimedeltaIndex it can create some strange and counterintuitive behaviour.

Some example to explain better:

# Create a timeseries with 10 Hz timedelta index (one sample each 0.1 s)
# i.e. index contains values ['00:00:00', '00:00:00.1', '00:00:00.2', …] and
# the series values represent the sample number
idx = pd.timedelta_range(0, '10s', freq='100ms')
ts = pd.Series(np.arange(len(idx)), index=idx)

# I want to get a specific sample, at '00:00:03'
ts.loc['3s']  # returns the value at '00:00:03' (i.e. sample 30)
assert ts.loc['3s'] == 30  # indeed

# Now I want to get all samples until at '00:00:03' 
ts.loc[:'3s']  # this returns all values until '00:00:03.90' (i.e. sample 39)
assert ts.loc[:'3s'][-1] == 30  # this fails, because the last element is not 30 but 39

df.loc[:'3.000s']  # this again returns all values until '00:00:03.90'
assert ts.loc[:'3.000s'][-1] == 30  # fails, again

df.loc[:'3.001s']  # this instead returns all values until '00:00:03'
assert ts.loc[:'3.001s'][-1] == 30  # success!

# The paradox: selecting until '3.000s' returns more than selecting until '3.001s' (!)
len(ts.loc[:'3.000s']) > len(ts.loc[:'3.001s'])  # True

# Using `pandas.Timedelta` objects solves the ambiguity
ts.loc[:pd.Timedelta('3s')]  # returns all values until '00:00:03'
ts.loc[:pd.Timedelta('3s')][-1] == 30  # True

This has to do with the resolution parsed from the timedelta string. Maybe for timedelta indices it would make more sense to always use the resolution of the index? Or provide an alternative implementation (e.g. FixedResolutionTimedeltaIndex) allowing for this use case?

pbotros commented 4 years ago

Sorry for the delay here - revisiting this. I agree with @mattbit that while it's the intended behavior, it's not very intuitive. I'd say that always using the resolution of the index makes sense - presumingly that's how non-range indexing (e.g. ts.loc['3s'] in the above example) works. Unfortunately though, this does change the behavior for existing implementations.

@mroeschke thoughts on the gravity of changing this behavior? Alternatively @mattbit's idea of a new type of Index can make the transition opt-in, but then that's one more class that people will have to think about going forward.

mroeschke commented 4 years ago

I'd be -1 to adding an entirely separate FixedResolutionTimedeltaIndex.

I'd be +0 to changing the behavior of partial string indexing. Would probably need more buy in from other core maintainers before deprecating and changing this behavior.

pbotros commented 4 years ago

Sounds good - thanks @mroeschke . Any particular core maintainers you have in mind and could add to this issue (or are they notified already)?

mroeschke commented 4 years ago

@pandas-dev/pandas-core for thoughts about changing partial string indexing for TimedeltaIndex

jorisvandenbossche commented 4 years ago

I think it would be helpful for the discussion if someone could write up some small examples with the current behaviour and proposed future behaviour with rationale for the change, and how it compares to how it works without string parsing (actual Timedelta objects) / with datetimeindex partial string indexing. (eg the examples from https://github.com/pandas-dev/pandas/issues/33603#issuecomment-620703496 might be a good start, but with including output and proposed behaviour, it might be easier to understand what this is about without requiring everyone to run the code themselves).

abrammer commented 3 years ago

I think I came just across this issue, that as above provides very counter intuitive results. Reproducible example adapted straight from the docs on slicing with partial timedelta strings. I guess the solution is to not use strings, and rather use Timedelta objects (even if they're created from the same string)

edit, I see this is pretty much the same as the above comment. so 12 months later, still confusing people.

import numpy as np
import pandas as pd
s = pd.Series(
        np.arange(100),
        index=pd.timedelta_range("1 days", periods=100, freq="h"),
)

def compare_indexing(end_time):
    print(end_time)
    print(len(s[:end_time]))
    print(sum(s.index <= pd.Timedelta(end_time)))
    assert len(s[:end_time]) == sum(s.index <= pd.Timedelta(end_time))

end_time = "1d 23 hours"
compare_indexing(end_time)

end_time = "2d 0.1 hours"  # some how works...
compare_indexing(end_time) 

end_time = "2d 0 hours"
compare_indexing(end_time)

for my own curiosity I dug into this a little bit.
From https://github.com/pandas-dev/pandas/blob/2cb96529396d93b46abab7bbc73a208e708c642e/pandas/core/indexes/timedeltas.py#L220 the parsed resolution string seems to be the cause for confusion.

parsed = Timedelta(label)
print('resolution:', parsed.resolution_string)

47H
resolution: H
49H
resolution: H
48H 
resolution: D

The resolution is evaluated against the value being 0 or not, so not sure how that'd be "fixed", https://github.com/pandas-dev/pandas/blob/2cb96529396d93b46abab7bbc73a208e708c642e/pandas/_libs/tslibs/timedeltas.pyx#L953

jbrockmendel commented 2 years ago

I think whats happening here is that we are not actually getting the resolution of the string, just the Timedelta constructed from it. By contrast with DatetimeIndex, the parsing code also returns information about the string's specificity.

randolf-scholz commented 1 year ago

The current behaviour seems self-inconsistent:

from datetime import timedelta
import numpy as np
import pandas as pd

# create 24h range with 1 min spacing
td = pd.timedelta_range(start="0h", end="24h", freq="1min").to_series()

# last timedelta included in slice
print(td[: timedelta(hours=1)].iloc[-1])          # 01:00
print(td[: np.timedelta64(1, "h")].iloc[-1])      # 01:00
print(td[: pd.Timedelta(1, "h")].iloc[-1])        # 01:00
print(td[: pd.Timedelta("1h")].iloc[-1])          # 01:00
print(td[:"1h"].iloc[-1])                         # 01:59 ✘

# with loc
print(td.loc[: timedelta(hours=1)].iloc[-1])      # 01:00
print(td.loc[: np.timedelta64(1, "h")].iloc[-1])  # 01:00
print(td.loc[: pd.Timedelta(1, "h")].iloc[-1])    # 01:00
print(td.loc[: pd.Timedelta("1h")].iloc[-1])      # 01:00
print(td.loc[:"1h"].iloc[-1])                     # 01:59 ✘

One would expect that if string values are passed they should simply be cast via pd.Timedelta.


@jorisvandenbossche The documentation tries to give some rationale for the behaviour. (or here)

In [109]: s["1 day":"2 day"]
Out[109]: 
1 days 00:00:00     0
1 days 01:00:00     1
1 days 02:00:00     2
1 days 03:00:00     3
1 days 04:00:00     4
                   ..
2 days 19:00:00    43
2 days 20:00:00    44
2 days 21:00:00    45
2 days 22:00:00    46
2 days 23:00:00    47
Freq: H, Length: 48, dtype: int64

And it is kind of neat that one can write something like dt["1 day":"1 day"] to get all values that belong to "day 1". But that is also kind of the issue, because here the lower bound and the upper bound are effectively translated to different values.

Another detail is that the slicing with a datetime.timedelta, pd.Timedelta and np.timedelta64 gives a an inclusive upper bound, which makes it difficult to select day ranges without getting the "0:00" value of the following day.

There is probably a good reason for this behaviour but the design decisions for this are likely deeply buried somewhere. In principle I think it would make more sense if time ranges were half-open intervals, just like it is with ints and floats.

deepers commented 5 months ago

I just came across this issue via xarray's DataArray.sel(), and I agree that it is counter-intuitive, especially that strings and their corresponding Timedelta objects produce different results. I think that Timedeltas are different to Timestamps, in that they represent an unambiguous time interval, so selecting by a slice should return all index elements (inclusive) between the two Timedeltas. This is the current behavior for Timedeltas but not strings.