pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.73k stars 17.94k forks source link

to_string() not easily reversible for multi-index DataFrames #25570

Open randolf-scholz opened 5 years ago

randolf-scholz commented 5 years ago

Code Sample, a copy-pastable example if possible

import pandas
import numpy as np
from itertools import product
df1 = pandas.DataFrame(product(['a', 'b'], range(3)), columns=['idx1', 'idx2'])
df2 = pandas.DataFrame(np.random.rand(6, 2), columns=['col1', 'col2'])
df  = pandas.concat([df1, df2], axis=1)
df.set_index(['idx1','idx2'], inplace=True)
with open('test.dat', 'w') as file:
    file.write(df.to_string())   
df_read = pandas.read_csv('test.dat')
print(df.values)
print(df_read.values)
df.equals(df_read)  # -> False

Problem description

I wanted to save a pandas DataFrame in human readable form, that is, as a text file with nice vertical alignment. The to_string function achieves precisely this, whereas to_csv does not.

I have saved data like this before, and it works just fine when one does not save the index. In this case it can be loaded via pandas.read_csv(file, sep=r'\s+'). I tried using the index_cols and header parameters but nothing seems to work properly.

I also made a StackExchange thread.

Expected Output

True

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.6.8.final.0 python-bits: 64 OS: Linux OS-release: 4.18.0-15-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 pandas: 0.24.1 pytest: None pip: 19.0.3 setuptools: 40.8.0 Cython: None numpy: 1.16.2 scipy: 1.2.1 pyarrow: None xarray: None IPython: 7.3.0 sphinx: 1.8.4 patsy: 0.5.1 dateutil: 2.8.0 pytz: 2018.9 blosc: None bottleneck: None tables: None numexpr: None feather: None matplotlib: 3.0.2 openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml.etree: None bs4: None html5lib: None sqlalchemy: None pymysql: None psycopg2: None jinja2: 2.10 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None gcsfs: None
TomAugspurger commented 5 years ago

What precise changes to the output (or reader?) are you proposing?

FWIW, I don't think that faithful round-tripping is a high priority for to_string. If that's your goal, there are plenty of better options.

randolf-scholz commented 5 years ago

@TomAugspurger

My main goal is to save a multi-index DataFrame in human readable form and be able to load it again.

TomAugspurger commented 5 years ago

That's a difficult task :) What exact changes are you proposing?

On Wed, Mar 6, 2019 at 9:14 AM randolf-scholz notifications@github.com wrote:

@TomAugspurger https://github.com/TomAugspurger

My main goal is to save a multi-index DataFrame in human readable form and be able to load it again.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pandas-dev/pandas/issues/25570#issuecomment-470144790, or mute the thread https://github.com/notifications/unsubscribe-auth/ABQHIiP0zs5kG1ZDfHfFroX9ovYhoREeks5vT9tTgaJpZM4bhD6_ .

randolf-scholz commented 5 years ago

That's a difficult task :) What exact changes are you proposing? On Wed, Mar 6, 2019 at 9:14 AM randolf-scholz @.***> wrote: @TomAugspurger https://github.com/TomAugspurger My main goal is to save a multi-index DataFrame in human readable form and be able to load it again. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#25570 (comment)>, or mute the thread https://github.com/notifications/unsubscribe-auth/ABQHIiP0zs5kG1ZDfHfFroX9ovYhoREeks5vT9tTgaJpZM4bhD6_ .

To be honest I think that a faithful reverse to to_string is probably the best option. And the task should not be as difficult because all the information needed to reconstruct the DataFrame is contained in the string representation. In fact to me it seems all that one needs to do is

  1. Compute the column widths by analyzing the header
  2. Extract columns corresponding to the data & the index seperately
  3. Refill missing entries in the index

Which is essentially what is proposed in this answer: https://stackoverflow.com/a/55024872/9318372

WillAyd commented 5 years ago

Would #10415 fit your needs instead?

chris-b1 commented 5 years ago

Some previous discussion buried in this thread here - idea of a read_repr https://github.com/pandas-dev/pandas/issues/8323#issuecomment-56278302

randolf-scholz commented 5 years ago

Thanks @WillAyd and @chris-b1 for the suggestions. I am using the following script now to read the files:

df = pandas.read_fwf(f, header=[0,1])
cols = [x for x,_ in df.columns if 'Unnamed' not in x]
idxs = [y for _,y in df.columns if 'Unnamed' not in y]
df.columns = idxs + cols
df[idxs] = df[idxs].ffill()
df.set_index(idxs, inplace=True)

I do believe strongly though that the ability to read and write tables in human readable form should be a core-functionality of a module like pandas.

WillAyd commented 5 years ago

Makes sense. You could also specify index_col=[0, 1] and swap things around thereafter (might be easier).

I would side with @TomAugspurger assessment of the priority here as I (perhaps mistakenly) couldn't see this being useful outside of very small DataFrames when compared to the slew of other methods that exist.

With that said this is open source so let's see what others think. You are of course always welcome to submit a PR if you see an easy and scalable way to make it work