pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.48k stars 17.87k forks source link

DOC: `DataFrame.reindex` columns filling #40690

Open markoshorro opened 3 years ago

markoshorro commented 3 years ago

Code Sample, a copy-pastable example

import pandas as pd
import numpy as np

frame = pd.DataFrame( np.arange(9).reshape( (3, 3) ), index=['a', 'c', 'd'], columns=['Caladan', 'Corrin', 'Ix'] )
"""
   Caladan  Corrin  Ix
a        0       1   2
c        3       4   5
d        6       7   8
"""

frame.reindex(index=['a','b','c','d'],columns=["Caladan","CZ","Ix"],method="ffill")

"""
comment: this is OK
   Caladan  CZ  Ix
a        0 NaN   2
b        0 NaN   2
c        3 NaN   5
d        6 NaN   8
"""

frame.reindex(index=['a','b','c','d'],columns=["Caladan","DZ","Ix"],method="ffill")
"""
comment: OK this is weird...
   Caladan  DZ  Ix
a        0   1   2
b        0   1   2
c        3   4   5
d        6   7   8
"""

frame.reindex(index=['a','b','c','d'],columns=["Caladan","TZ","Ix"],method="ffill")
"""
comment: wat?
   Caladan  TZ  Ix
a        0   2   2
b        0   2   2
c        3   5   5
d        6   8   8
"""

Problem description

DataFrame.reindex, when changing columns labeling, if column name is unknown it should not fill that column, even though it does. Besides, it has different behaviors depending on the new column name, which is even stranger.

Expected Output

New column not filled; same behavior when varying the name of new column/s.

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit : 7d32926db8f7541c356066dcadabf854487738de python : 3.8.5.final.0 python-bits : 64 OS : Linux OS-release : 5.4.72-microsoft-standard-WSL2 Version : #1 SMP Wed Oct 28 23:40:43 UTC 2020 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 1.2.3 numpy : 1.20.1 pytz : 2021.1 dateutil : 2.8.1 pip : 20.0.2 setuptools : 45.2.0 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : 2.11.3 IPython : 7.13.0 pandas_datareader: None bs4 : None bottleneck : None fsspec : None fastparquet : None gcsfs : None matplotlib : 3.3.4 numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pyxlsb : None s3fs : None scipy : 1.6.1 sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlwt : None numba : None
mzeitlin11 commented 3 years ago

Thanks for the report @markoshorro. What's happening is somewhat strange, but I believe it is intended (but definitely could be better documented). For the first case, CZ < Caladin < Corrin < Ix, so with forward filling specified, CZ becomes a NaN column since there is nothing to forward fill. For the second case, Caladin < Corrin < DZ < Ix, so the values from Corrin are forward filled into DZ. For the final case, Caladin < Corrin < Ix < TZ, so the values from Ix are forward filled into TZ.

markoshorro commented 3 years ago

Thanks, @mzeitlin11, for your prompt and right response. (Just for clarity, I believe you have a small typo in the "(...) second case, Caladin < Corrin < DZ < Ix, (...)" instead of "CZ"; just a minor issue :-)).

Now I do understand. My first thought was that method in reindex should affect gaps in new rows but only over existing columns. Documentation describes method parameter as:

Method to use for filling holes in reindexed DataFrame. Please note: this is only applicable to DataFrames/Series with a monotonically increasing/decreasing index.

So my first thought was that gaps in new columns should not be affected. It could be just my misunderstanding, but I believe some further clarification could be helpful for users. Will prepare a small pull request addressing this issue.

mzeitlin11 commented 3 years ago

Thanks, @mzeitlin11, for your prompt and right response. (Just for clarity, I believe you have a small typo in the "(...) second case, Caladin < Corrin < DZ < Ix, (...)" instead of "CZ"; just a minor issue :-)).

Will edit, thanks!

Yep a pull request to improve the documentation / give an example would be very welcome (this is definitely confusing, not even 100% sure this behavior is what it should be)

markoshorro commented 3 years ago

I will dig a bit before writing anything following your examples, but I will definitely improve current documentation clarifying these cases.