Supporting Excel Spreadsheets?

pydata / xarray

N-D labeled arrays and datasets in Python

https://xarray.dev

Apache License 2.0

3.6k stars 1.08k forks source link

Supporting Excel Spreadsheets? #9385

Open TomNicholas opened 2 months ago

TomNicholas commented 2 months ago

from @ahuang11 in https://github.com/xarray-contrib/datatree/issues/342

I wonder if xarray-datatree should support reading Excel spreadsheets since a lot of the world still uses Excel, and I'm working with one right now.

I imagine the tree's leaves would contain each sheet, analogous to netCDF groups.

Perhaps it doesn't have to be limited Excel, and would be able to read Google Sheets directly by passing a URL.

Just a random thought.

TomNicholas commented 2 months ago

That's an interesting idea... I think this would only be useful if the spreadsheet followed some specific schema though.

An experiment would be using pandas.read_excel to return multiple sheets as a dict of pd.DataFrame objects, followed by calling xarray.Dataset.from_dataframe for each dataframe, and then using DataTree.from_dict.

If that actually works out then maybe we could add it as an example to the IO page on xarray's documentation.

(Note also that this idea isn't really datatree-specific, because you could use pandas.read_excel(..., sheet_name='some_name') to read one sheet and create one xr.Dataset.)

ahuang11 commented 2 months ago

pd.read_excel(..., sheet_name=None) already returns multiple sheets as a dict.

Also, I think some excel spreadsheets are just nested CSVs in one file.

ahuang11 commented 2 months ago

import pandas as pd
import xarray as xr
from datatree import DataTree

dfs = pd.read_excel("sheets.xlsx", sheet_name=None)
ds_dict = {}
for sheet_name, df in dfs.items():
    ds_dict[sheet_name] = xr.Dataset.from_dataframe(df)
dt = DataTree.from_dict(ds_dict)
dt

TomNicholas commented 2 months ago

That's cool!

Illviljan commented 2 months ago

Using pd.read_excel directly is not lazy though. Creating a backend for it would make it lazy.