pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.71k stars 17.92k forks source link

reindex-like has inconsistent behaviour and exceptions if columns don't match #31002

Open sanderland opened 4 years ago

sanderland commented 4 years ago

Code Sample, a copy-pastable example if possible

df_1 = pd.DataFrame(data=[1,2,3],columns=['a'])
df_2 = pd.DataFrame(data=[4,5,6],index=[0.5,1.5,2.5],columns=['b'])
df_2.reindex_like(df_1, method="bfill") # backfills as if columns matched
df_2.reindex_like(df_1, method="ffill")  # all NaN
df_2.reindex_like(df_1, method="nearest") # exception

Problem description

It appears different methods of filling treat non matching columns differently. I came across this when trying to reindex two dataframes with one column each, whose names didn't match (as they were essentially irrelevant).

Workaround

before reindex do df_2.columns = df_1.columns

Expected Output

The documentation says:

Its row and column indices are used to define the new indices of this object.

Exactly whether this means column name or position is not super clear, but either way the current behaviour is inconsistent. I would prefer if all of them worked like bfill.

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit : None python : 3.7.3.final.0 python-bits : 64 OS : Linux OS-release : 5.0.0-37-generic machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 0.25.3 numpy : 1.17.0 pytz : 2018.9 dateutil : 2.8.0 pip : 19.0.3 setuptools : 41.0.1 Cython : 0.29.6 pytest : 4.3.1 hypothesis : None sphinx : 1.8.5 blosc : None feather : None xlsxwriter : 1.1.5 lxml.etree : 4.3.2 html5lib : 1.0.1 pymysql : None psycopg2 : None jinja2 : 2.10 IPython : 7.4.0 pandas_datareader: None bs4 : 4.7.1 bottleneck : 1.2.1 fastparquet : None gcsfs : None lxml.etree : 4.3.2 matplotlib : 3.0.3 numexpr : 2.6.9 odfpy : None openpyxl : 2.6.1 pandas_gbq : None pyarrow : None pytables : None s3fs : None scipy : 1.2.1 sqlalchemy : 1.3.1 tables : 3.5.1 xarray : None xlrd : 1.2.0 xlwt : 1.3.0 xlsxwriter : 1.1.5
sanderland commented 4 years ago

Tracked it down to _reindex_axes in frame.py: the same method is used for reindexing index and columns, which means that some kind of string ordering means 'a' ffills into 'b' (reversing the column names in the example makes bfill fail and ffill succeed). Either way, the nearest method is very unhappy about all of this.

    def _reindex_axes(self, axes, level, limit, tolerance, method, fill_value, copy):
        frame = self

        columns = axes["columns"]
        if columns is not None:
            frame = frame._reindex_columns(
                columns, method, copy, level, fill_value, limit, tolerance
            )

        index = axes["index"]
        if index is not None:
            frame = frame._reindex_index(
                index, method, copy, level, fill_value, limit, tolerance
            )

        return frame
Dr-Irv commented 4 years ago

PR is welcome. Issue here seems to be twofold:

  1. The docs aren't clear that the method argument really only makes sense when applied to the index, not the columns, and will only be used on columns where the names match.
  2. There is a bug in the case of bfill and nearest when the column names don't match, because we are returning a result with backfilled values (or raising in the case of nearest), but we really don't have data to do the backfilling with, so we should just return series with NaN
sqali commented 1 year ago

Hi @Dr-Irv ,

I am new to open source and would love to work on this issue.

sqali commented 1 year ago

take

Dr-Irv commented 1 year ago

Hi @Dr-Irv ,

I am new to open source and would love to work on this issue.

It's now all yours!