pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.8k stars 17.98k forks source link

BUG: read_csv does not comply with the to_csv format (no support of df.columns.names) #49862

Open GergelyMincsovicsPhilips opened 1 year ago

GergelyMincsovicsPhilips commented 1 year ago

Pandas version checks

Reproducible Example

import pandas as pd
mi1 = pd.MultiIndex.from_arrays([[1, 2]], names=['s'])
mi3 = pd.MultiIndex.from_arrays([[1, 2], [3, 4], [5, 6]], names=['x', 'y', 'z'])
df = pd.DataFrame(index=mi3,columns=mi1)
df.to_csv("data.csv")
df2=pd.read_csv("data.csv", index_col=[0,1,2], header=[0])
print(df)
print(df2)

Issue Description

read_csv does not support df.columns.names df != df2

Expected Behavior

df==df2

Installed Versions

python3\Lib\site-packages\_distutils_hack\__init__.py:33: UserWarning: Setuptools is replacing distutils. warnings.warn("Setuptools is replacing distutils.") INSTALLED VERSIONS ------------------ commit : 91111fd99898d9dcaa6bf6bedb662db4108da6e6 python : 3.11.0.final.0 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.19044 machine : AMD64 processor : Intel64 Family 6 Model 141 Stepping 1, GenuineIntel byteorder : little LC_ALL : None LANG : None LOCALE : English_United States.1252 pandas : 1.5.1 numpy : 1.23.4 pytz : 2022.6 dateutil : 2.8.2 setuptools : 65.5.0 pip : 22.3.1 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : 3.0.3 lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : None IPython : None pandas_datareader: None bs4 : None bottleneck : None brotli : None fastparquet : None fsspec : None gcsfs : None matplotlib : None numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pyreadstat : None pyxlsb : None s3fs : None scipy : None snappy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlwt : None zstandard : None tzdata : None
topper-123 commented 1 year ago

In the line pd.read_csv("data.csv", index_col=[0,1,2], header=[0]) pandas has no way to know that the second line should be the index names instead of values in the index.

Having said that maybe could be an index_col_names parameter, that you'd set to [1] to get the names in this case. That should probably imply that the rest of that line should be empty, so we don't lose information, e.g.:

x,y,z,,

would give the index names "x", "y" & "z", while

x,y,z,1,3

should probably raise, because there would be no way to place the values 1 & 3.

GergelyMincsovicsPhilips commented 1 year ago

Yes, indeed. I expect that there should be some way of restoring a data frame written out using to_csv.

topper-123 commented 1 year ago

I think this could be a good addition to the read_csv function. Are you up for making a PR on this?

GergelyMincsovicsPhilips commented 1 year ago

is not this a PR? not sure what additional info I could provide

topper-123 commented 1 year ago

Hey, no this is an issue (i.e. where bugs, enhancements etc. are discussed, before submitted a PR).

A PR (Pull Request) is actually submitted code to fix an issue. See the pane "Pull requests" for all open PRs.

GergelyMincsovicsPhilips commented 1 year ago

oh ok, I thought of a problem report well, not on the short term, I could help testing though