pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.8k stars 17.98k forks source link

wide_to_long with NaNs in i argument #31976

Open MalteMax opened 4 years ago

MalteMax commented 4 years ago

Code and output:

import numpy as np
import pandas as pd
df1 = pd.DataFrame({'id' : ['a', np.NaN, 'c'],
                   'A_1999' : [1, 2, 3],
                   'A_2000' : [5, 6, 7]})

df1
    id  A_1999  A_2000
0    a       1       5
1  NaN       2       6
2    c       3       7

df2 = pd.wide_to_long(df1,
                      stubnames = ['A'],
                      sep = '_',
                      i = 'id',
                      j = 'year').reset_index()

df2
  id  year  A
0  a  1999  1
1  a  1999  2
2  c  1999  3
3  a  2000  5
4  a  2000  6
5  c  2000  7

Problem description

df1 has one NaN in the id column. When applying wide_to_long to df1, wide_to_long seems to fill the NaNs from df1 with the preceding value from the df1 dataframe (namely, a).

Why might this be a problem: instances of id == 'a' in df1 are not the same as in df2 (because the NaNs from df1 have been replaced by a in df2):

df1.loc[df1['id'] == 'a',]

  id  A_1999  A_2000
0  a       1       5

versus:

df2.loc[df2['id'] == 'a',]

  id  year  A
0  a  1999  1
1  a  1999  2
3  a  2000  5
4  a  2000  6

Expected Output

I would expect wide_to_long to either

  1. throw an error that the column that is passed to its i argument contains NaNs, or
  2. preserve the NaN case in the second dataframe (not sure whether this is desirable), i.e.:
    id  year  A
0    a  1999  1
1  NaN  1999  2
2    c  1999  3
3    a  2000  5
4  NaN  2000  6
5    c  2000  7

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit : None pandas : 1.0.1 numpy : 1.18.1 pytz : 2019.3 dateutil : 2.8.1 pip : 20.0.2 setuptools : 45.2.0.post20200210 Cython : 0.29.15 pytest : 5.3.5 hypothesis : 5.4.1 sphinx : 1.8.5 blosc : None feather : None xlsxwriter : 1.2.7 lxml.etree : 4.5.0 html5lib : 1.0.1 pymysql : None psycopg2 : 2.8.3 (dt dec pq3 ext lo64) jinja2 : 2.11.1 IPython : 7.12.0 pandas_datareader: None bs4 : 4.8.2 bottleneck : 1.3.1 fastparquet : None gcsfs : None lxml.etree : 4.5.0 matplotlib : 3.1.3 numexpr : 2.7.1 odfpy : None openpyxl : 3.0.3 pandas_gbq : None pyarrow : None pytables : None pytest : 5.3.5 pyxlsb : None s3fs : None scipy : 1.4.1 sqlalchemy : 1.3.13 tables : 3.6.1 tabulate : None xarray : None xlrd : 1.2.0 xlwt : 1.3.0 xlsxwriter : 1.2.7 numba : 0.43.1
ljluestc commented 4 weeks ago

import numpy as np
import pandas as pd

# Create the original DataFrame with NaN
df1 = pd.DataFrame({'id': ['a', np.NaN, 'c'],
                    'A_1999': [1, 2, 3],
                    'A_2000': [5, 6, 7]})

# Store the original NaN ids for reference
nan_ids = df1[df1['id'].isna()]

# Apply wide_to_long transformation
df2 = pd.wide_to_long(df1,
                      stubnames=['A'],
                      sep='_',
                      i='id',
                      j='year').reset_index()

# Restore NaNs in the id column of df2 where they were originally in df1
for index, row in nan_ids.iterrows():
    # Find the corresponding rows in df2 to set id back to NaN
    df2.loc[(df2['year'] == 1999) & (df2['A'] == row['A_1999']), 'id'] = np.NaN
    df2.loc[(df2['year'] == 2000) & (df2['A'] == row['A_2000']), 'id'] = np.NaN

# Display the transformed DataFrame
print(df2)