Is there a more efficient way to convert a subset of variables to a dataframe?

naught101 commented 8 years ago

I have the following chunk of code that gets used a lot in my scripts:

> /data/documents/uni/phd/projects/pals_utils/pals_utils/data.py(291)pals_xr_to_df()
    289     # TODO: This is not suitable for gridded datasets:
    290     index_vars = {v: dataset.coords[v].values[0] for v in index_vars}
1-> 291     df = dataset.sel(**index_vars)[data_vars].to_dataframe()[data_vars]
    292 
    293     if qc:

It basically extracts a few data_vars from a dataset, and converts it to a dataframe, limiting the axis to a single grid-cell (this particular data only has one location anyway). The first [data_vars] call massively improve the efficiency (by dropping most variables before converting to a dataframe), the second one is to get rid of the x, y, and z in the dataframe (side-issue: it would be nice to have a drop_dims= option in .to_dataframe that dropped all dimensions of length 1)

Here's an example of it in use:

ipdb> index_vars
{'y': 1.0, 'x': 1.0, 'z': 1.0}

ipdb> data_vars
['Qle']

ipdb> dataset
<xarray.Dataset>
Dimensions:           (time: 70128, x: 1, y: 1, z: 1)
Coordinates:
  * x                 (x) float64 1.0
  * y                 (y) float64 1.0
  * time              (time) datetime64[ns] 2002-01-01T00:30:00 ...
  * z                 (z) float64 1.0
Data variables:
    latitude          (y, x) float64 -35.66
    longitude         (y, x) float64 148.2
    elevation         (y, x) float64 1.2e+03
    reference_height  (y, x) float64 70.0
    NEE               (time, y, x) float64 1.597 1.651 1.691 1.735 1.778 ...
    Qh                (time, y, x) float64 -26.11 -25.99 -25.89 -25.78 ...
    Qle               (time, y, x) float64 5.892 5.898 5.864 5.826 5.788 ...
Attributes:
    Production_time: 2012-09-27 12:44:42
    Production_source: PALS automated netcdf conversion
    PALS_fluxtower_template_version: 1.0.2
    PALS_dataset_name: TumbaFluxnet
    PALS_dataset_version: 1.4
    Contact: palshelp@gmail.com

ipdb> dataset.sel(**index_vars)[data_vars].to_dataframe()[data_vars].head()
                          Qle
time                         
2002-01-01 00:30:00  5.891888
2002-01-01 01:00:00  5.898049
2002-01-01 01:30:00  5.863696
2002-01-01 02:00:00  5.825712
2002-01-01 02:30:00  5.787727

This particular line of code eventually calls pandas.tslib.array_to_timedelta64, which takes up a significant chunk of my script's run time. My line of code doesn't look like it's the best way to do things, and I'm wondering if there's any way to get the same resulting data that's more efficient. Any help would be greatly appreciated.

shoyer commented 8 years ago

The simplest thing to try is making use of .squeeze(), e.g., dataset[data_vars].squeeze().to_dataframe(). Does that have any better performance? At least it's a bit less typing.

I'm not sure why pandas.tslib.array_to_timedelta64 is slow here, or even how it is being called in your example. I would need a complete example that I can run to debug that.

naught101 commented 8 years ago

Squeeze is pretty much identical in efficiency. Seems very slightly better (2-5%) on smaller datasets. (I still need to add the final [data_vars] to get rid of the extraneous index_var columns, but that doesn't affect performance much).

I'm not calling pandas.tslib.array_to_timedelta64, to_dataframe is - the caller list is (sorry, I'm not sure of a better way to show this):

caller_graph

shoyer commented 8 years ago

Try calling .load() before .to_dataframe

naught101 commented 8 years ago

Slightly slower (using %timeit in ipython)

shoyer commented 8 years ago

How did you construct this dataset?

naught101 commented 8 years ago

I loaded it from a netcdf file. There's an example you can play with at https://dl.dropboxusercontent.com/u/50684199/MitraEFluxnet.1.4_flux.nc

shoyer commented 8 years ago

can you give me a copy/pastable script that has the slowness issue with that file?

naught101 commented 8 years ago

Not easily - most scripts require multiple (up to 200, of which the linked one is one of the smallest, some are up to 10Mb) of these datasets in a specific directory structure, and rely on a couple of private python modules. I was just asking because I thought I might have been missing something obvious, but now I guess that isn't the case. Probably not worth spending too much time on this - if it starts becoming a real problem for me, I will try to generate something self-contained that shows the problem. Until then, maybe it's best to assume that xarray/pandas are doing the best they can given the requirements, and close this for now.

shoyer commented 8 years ago

One thing that might hurt is that xarray (lazily) decodes times from each file separately, rather than decoding times all at one. But this hasn't been much of an issue before even with hundreds of times, so I'm not sure what's going on here.

naught101 commented 8 years ago

Yeah, I'm loading each file separately with xr.open_dataset(), since it's not really a multi-file dataset (it's a lot of single-site datasets, some of which have different variables, and overlapping time dimensions). I don't think I can avoid loading them separately...

shoyer commented 8 years ago

Under the covers open_mfdataset just uses open_dataset and merge/concat. So this would be similar either way. On Mon, Nov 7, 2016 at 7:14 PM naught101 notifications@github.com wrote:

Yeah, I'm loading each file separately with xr.open_dataset(), since it's not really a multi-file dataset (it's a lot of single-site datasets, some of which have different variables, and overlapping time dimensions). I don't think I can avoid loading them separately...

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/pydata/xarray/issues/1086#issuecomment-259033970, or mute the thread https://github.com/notifications/unsubscribe-auth/ABKS1oUWnGIBO3mX5h56mgPvCbCU7PI3ks5q7-krgaJpZM4Kqw2_ .

naught101 commented 8 years ago

So it would be more efficient to concat all of the datasets (subset for the relevant variables), and then just use a single .to_dataframe() call on the entire dataset? If so, that would require quite a bit of refactoring on my part, but it could be worth it.

shoyer commented 8 years ago

So it would be more efficient to concat all of the datasets (subset for the relevant variables), and then just use a single .to_dataframe() call on the entire dataset? If so, that would require quite a bit of refactoring on my part, but it could be worth it.

Maybe? I'm not confident enough to advise you to go to that trouble.

naught101 commented 8 years ago

Ok, no worries. I'll try it if it gets desperate :)

Thanks for your help, shoyer!

andreall commented 4 years ago

I am running into the same problem, this might be a long shot but @naught101 , do you remember if you managed to convert to dataframe in a more efficient way? Thanks,

dcherian commented 4 years ago

can you make a reproducible example @andreall?

andreall commented 4 years ago

Hi,

import xarray as xr
from pathlib import Path

dir_input = Path('.')
data_ww3 = xr.open_mfdataset(dir_input.glob('**/' + 'WW3_EUR-11_CCCma-CanESM2_r1i1p1_CLMcom-CCLM4-8-17_v1_6hr_*.nc'))

data_ww3 = data_ww3.isel(latitude=74, longitude=18)
df_ww3 = data_ww3[['hs', 't02', 't0m1', 't01', 'fp', 'dir', 'spr', 'dp']].to_dataframe()

You can download one file here: https://nasgdfa.ugr.es:5001/d/f/566168344466602780 (3.5 GB). I did a profiler when opening 2 .nc files an it said the to_dataframe() call was the one taking most of the time.

src1

I'm just wondering if there's a way to reduce computing time. I need to open 95 files and it takes about 1.5 hour.

Thanks,

darothen commented 4 years ago

Hi @andreall, I'll leave @dcherian or another maintainer to comment on internals of xarray that might be pertinent for optimization here. However, just to throw it out there, for workflows like this, it can sometimes be a bit easier to process each NetCDF file (subsetting your locations and whatnot) and convert it to CSV individually, then merge/concatenate those CSV files together at the end. This sort of workflow can be parallelized a few different ways, but is nice because you can parallelize across the number of files you need to process. A simple example based on your MRE:

import xarray as xr
from pathlib import Path
from joblib import delayed, Parallel

dir_input = Path('.')
fns = list(sorted(dir_input.glob('**/' + 'WW3_EUR-11_CCCma-CanESM2_r1i1p1_CLMcom-CCLM4-8-17_v1_6hr_*.nc')))

# Helper function to convert NetCDF to CSV with our processing
def _nc_to_csv(fn):
    data_ww3 = xr.open_dataset(fn)
    data_ww3 = data_ww3.isel(latitude=74, longitude=18)
    df_ww3 = data_ww3[['hs', 't02', 't0m1', 't01', 'fp', 'dir', 'spr', 'dp']].to_dataframe()

    out_fn = fn.replace(".nc", ".csv")
    df_ww3.to_csv(out_fn)

    return out_fn

# Using joblib.Parallel to distribute my work across whatever resources i have
out_fns = Parallel(
    n_jobs=-1,  # Use all cores available here
    delayed(_nc_to_csv)(fn) for fn in fns
)

# Read the CSV files and merge them
dfs = [
    pd.read_csv(fn) for fn in out_fns
]
df_ww3_all = pd.concat(dfs, ignore_index=True)

YMMV but this pattern often works for many types of processing applications.

andreall commented 4 years ago

Hi @darothen , Thanks a lot..I hadn't thought of processing each file and then merging. Will give it a try, Thanks,

stale[bot] commented 2 years ago

In order to maintain a list of currently relevant issues, we mark issues as stale after a period of inactivity

If this issue remains relevant, please comment here or remove the stale label; otherwise it will be marked as closed automatically

max-sixty commented 11 months ago

Is there anything actionable that's outstanding here? Otherwise we can close...

pydata / xarray

Is there a more efficient way to convert a subset of variables to a dataframe? #1086