pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.27k stars 17.8k forks source link

BUG: KeyError from unexpected DatetimeIndex partial str indexing #40357

Open Alpima-Quant opened 3 years ago

Alpima-Quant commented 3 years ago

Code Sample, a copy-pastable example

import pandas as pd
df = pd.DataFrame({'a': [0, 1]}, index=[pd.Timestamp('2002-12-30T20'), pd.Timestamp('2003-01-03T20')])
df['T 1.75 1/3'] = [1, 2]

Problem description

When using __setitem__ or __getitem__ with a str key on a pandas DataFrame where the index has ._supports_partial_string_indexing = True, pandas first tries to convert the str key to a slice using index._get_string_slice(key).

It seems to me that we have too loose of an interpretation of what one of these partial string slices looks like (or perhaps when / how one might use them).

In my use case I have some timeseries data for some US treasury bills, where the names are something like f"T {coupon} {date}", e.g. "T 1.75 1/3" and I want to assign a new column to a DataFrame with some data on this instrument. However df['T 1.75 1/3'] = value - as in the example above - raises "ValueError: cannot set using a slice indexer with a different length than the value"

Expected Output

A new column with label "T 1.75 1/3" and values [1, 2] is assigned to the pandas DataFrame df.

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit : f2c8480af2f25efdbd803218b9d87980f416563e python : 3.9.2.final.0 python-bits : 64 OS : Linux OS-release : 4.19.128-microsoft-standard Version : #1 SMP Tue Jun 23 12:58:10 UTC 2020 machine : x86_64 processor : byteorder : little LC_ALL : None LANG : C.UTF-8 LOCALE : en_US.UTF-8 pandas : 1.2.3 numpy : 1.20.1 pytz : 2021.1 dateutil : 2.8.1 pip : 21.0.1 setuptools : 53.0.0 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : None IPython : 7.21.0 pandas_datareader: None bs4 : None bottleneck : None fsspec : None fastparquet : None gcsfs : None matplotlib : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pyxlsb : None s3fs : None scipy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlwt : None numba : None
rhshadrach commented 3 years ago

Thanks for the report! Doing this on master, one also gets the message:

FutureWarning: Indexing a DataFrame with a datetimelike index using a single string to slice the rows, like frame[string], is deprecated and will be removed in a future version. Use frame.loc[string] instead.

It appears that once this is removed, this issue will be resolved as well.

jbrockmendel commented 3 years ago

Part of this is also that the Timestamp parser (via the dateutil parser) is interpreting 'T 1.75 1/3' as a datetime when it probably shouldn't

>>> pd.Timestamp('T 1.75 1/3')
Timestamp('2003-01-01 00:00:00')