pydata / xarray

N-D labeled arrays and datasets in Python
https://xarray.dev
Apache License 2.0
3.57k stars 1.07k forks source link

Convert xarray dataset to dask dataframe or delayed objects #1093

Open atmalagon opened 7 years ago

atmalagon commented 7 years ago

It would be great to have a function like dask's to_delayed in order to take xarray datasets and convert them to pandas dataframes chunkily.

http://stackoverflow.com/questions/40475884/how-to-convert-an-xarray-dataset-to-pandas-dataframes-inside-a-dask-dataframe

shoyer commented 7 years ago

CC @mrocklin @jcrist

This is a good use case for dask collection duck typing: https://github.com/dask/dask/pull/1068

jcrist commented 7 years ago

I'm not sure if I follow how this is a duck typing use case. I'd write this as a method, following your suggestion on SO:

Toward this end, it would be nice if xarray had something like dask.array's to_delayed method for converting a Dataset into an array of delayed datasets, which you could then lazily convert into DataFrame objects and do your computation.

Can you explain why you think this could benefit from collection duck typing?

shoyer commented 7 years ago

Can you explain why you think this could benefit from collection duck typing?

Then we could use xarray's normal indexing operations to create a new sub-datasets, wrap them with dask.delayed and start chaining on delayed method calls like to_dataframe. The duck typing is necessary so that dask.delayed knows how to pull the dask graph out from the input Dataset.

shoyer commented 7 years ago

The other component that would help for this is some utility function inside xarray to split a Dataset (or DataArray) into sub-datasets for each chunk. Something like:

def split_by_chunks(dataset):
    chunk_slices = {}
    for dim, chunks in dataset.chunks.items():
        slices = []
        start = 0
        for chunk in chunks:
            stop = start + chunk
            slices.append(slice(start, stop))
            start = stop
        chunk_slices[dim] = slices
    for slices in itertools.product(*chunk_slices.values()):
         selection = dict(zip(chunk_slices.keys(), slices))
         yield (selection, dataset[selection])
dcherian commented 5 years ago

I think this was closed by mistake. Is there a way to split up Dataset chunks into dask delayed objects where each object is a Dataset?

stale[bot] commented 3 years ago

In order to maintain a list of currently relevant issues, we mark issues as stale after a period of inactivity

If this issue remains relevant, please comment here or remove the stale label; otherwise it will be marked as closed automatically