pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.71k stars 17.92k forks source link

Enabling auto-collapsing of Excel dumped from df #14083

Closed tkauffmann closed 1 year ago

tkauffmann commented 8 years ago

Dear all,

In my job, we use a lot of Pandas (though not always very professionally). One very useful case is to build a dataframe conatining data from several instruments for several items. I tend to use a lot of multi-indexing columns. In order to view (and modify) the columns, we like to save the dataframe in an Excel file, look at the data and possibly manually modify some values.

But with a lot of columns (espcially mult-indexed) we tend to have very big Excel files, that are not as readable as they should be. That is why I like to group some columns and enable the possibility to collapse them (and un-collapse them if I want to look at them).

I did not find a way to do it in Pandas ; but there it is possible in xlsxwriter. So I developed a short solution to automatically group all columns from the same multiindex column. If you guys think it is interesting to add the option to Pandas I'd be happy to try to let my work be useful to others -I could try to add it to Pandas though I'm new to GitHub. What do you guys think? Feel free to say no :)

Below is a toy example, and attached are the non-grouped and the grouped columns Excel sheets. Please don't be mad at the code, it's really quick&dirty way.

Code Sample, a copy-pastable example if possible

import pandas as pd
import numpy as np
from openpyxl.cell import get_column_letter
import copy

def save_df(df,ofn='output.xlsx'):
    writer = pd.ExcelWriter(ofn, engine='xlsxwriter')
    df.to_excel(writer, 'Sheet1')
    writer.save()

def save_df_collapsing(df, ofn='output.xlsx'):
    writer = pd.ExcelWriter(ofn, engine='xlsxwriter')
    df.to_excel(writer, 'Sheet1')
    worksheet = writer.sheets['Sheet1']

    col_mins = []
    col_maxs = []

    for ind0 in df.columns.levels[0]:
        slice_ind = df.columns.get_loc(ind0)
        try:
            slice_ind = slice_ind.indices(100000)
            col_min = slice_ind[0] + 2
            col_max = slice_ind[1]
        except:
            col_min = np.min(np.nonzero(slice_ind)) + 2
            col_max = np.max(np.nonzero(slice_ind))
        col_mins.append(col_min)
        col_maxs.append(col_max)
        span = '%s:%s' % (get_column_letter(col_min), get_column_letter(col_max))
        worksheet.set_column(span, None, None, {'level': 1})

        temp = copy.copy(df[ind0])
        ind1s = np.unique([ind1[0] for ind1 in temp.columns])
        for ind1 in ind1s:
            slice_ind1 = df.columns.get_loc((ind0, ind1)).indices(100000)

            col_min1 = slice_ind1[0] + 2
            if col_min1 in col_mins:
                col_min1 += 1
            col_max1 = slice_ind1[1]
            if col_max1 in col_maxs:
                col_max1 -= 1
            # worksheet.set_column(col_min,col_max, None, None, {'level': 2})
            span1 = '%s:%s' % (get_column_letter(col_min1), get_column_letter(col_max1))
            worksheet.set_column(span1, None, None, {'level': 2, 'hidden': True})

    writer.save()

temperatures = np.arange(100)
columns = pd.MultiIndex.from_product([['Absolute','Relative'],['ValueA', 'ValueB', 'ValueC'], temperatures])
df = pd.DataFrame(columns=columns)
df.loc['Item1', ('Relative','ValueA')] = np.arange(100)
df.loc['Item1', ('Relative','ValueB')] = 2 * np.arange(100)
df.loc['Item1', ('Relative','ValueC')] = 3 * np.arange(100)

df.loc['Item1', ('Absolute','ValueA')] = np.arange(100) + 25
df.loc['Item1', ('Absolute','ValueB')] = 2 * np.arange(100) + 25
df.loc['Item1', ('Absolute','ValueC')] = 3 * np.arange(100) + 25

df.loc['Item2', ('Relative','ValueA')] = 10 * np.arange(100)
df.loc['Item2', ('Relative','ValueB')] = 20 * np.arange(100)
df.loc['Item2', ('Relative','ValueC')] = 30 * np.arange(100)

df.loc['Item2', ('Absolute','ValueA')] = 10 * np.arange(100) + 25
df.loc['Item2', ('Absolute','ValueB')] = 20 * np.arange(100) + 25
df.loc['Item2', ('Absolute','ValueC')] = 30 * np.arange(100) + 25
print df

save_df(df,'output_no_collapse.xlsx')
save_df_collapsing(df,'output_collapsed.xlsx')

output_collapsed.xlsx output_no_collapse.xlsx

mroeschke commented 1 year ago

Thanks for the suggestion but appears there hasn't been much traction or community interest in this feature so closing