Closed naught101 closed 11 months ago
The simplest thing to try is making use of .squeeze()
, e.g., dataset[data_vars].squeeze().to_dataframe()
. Does that have any better performance? At least it's a bit less typing.
I'm not sure why pandas.tslib.array_to_timedelta64
is slow here, or even how it is being called in your example. I would need a complete example that I can run to debug that.
Squeeze is pretty much identical in efficiency. Seems very slightly better (2-5%) on smaller datasets. (I still need to add the final [data_vars]
to get rid of the extraneous index_var columns, but that doesn't affect performance much).
I'm not calling pandas.tslib.array_to_timedelta64
, to_dataframe
is - the caller list is (sorry, I'm not sure of a better way to show this):
Try calling .load()
before .to_dataframe
Slightly slower (using %timeit
in ipython)
How did you construct this dataset?
I loaded it from a netcdf file. There's an example you can play with at https://dl.dropboxusercontent.com/u/50684199/MitraEFluxnet.1.4_flux.nc
can you give me a copy/pastable script that has the slowness issue with that file?
Not easily - most scripts require multiple (up to 200, of which the linked one is one of the smallest, some are up to 10Mb) of these datasets in a specific directory structure, and rely on a couple of private python modules. I was just asking because I thought I might have been missing something obvious, but now I guess that isn't the case. Probably not worth spending too much time on this - if it starts becoming a real problem for me, I will try to generate something self-contained that shows the problem. Until then, maybe it's best to assume that xarray/pandas are doing the best they can given the requirements, and close this for now.
One thing that might hurt is that xarray (lazily) decodes times from each file separately, rather than decoding times all at one. But this hasn't been much of an issue before even with hundreds of times, so I'm not sure what's going on here.
Yeah, I'm loading each file separately with xr.open_dataset()
, since it's not really a multi-file dataset (it's a lot of single-site datasets, some of which have different variables, and overlapping time dimensions). I don't think I can avoid loading them separately...
Under the covers open_mfdataset just uses open_dataset and merge/concat. So this would be similar either way. On Mon, Nov 7, 2016 at 7:14 PM naught101 notifications@github.com wrote:
Yeah, I'm loading each file separately with xr.open_dataset(), since it's not really a multi-file dataset (it's a lot of single-site datasets, some of which have different variables, and overlapping time dimensions). I don't think I can avoid loading them separately...
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/pydata/xarray/issues/1086#issuecomment-259033970, or mute the thread https://github.com/notifications/unsubscribe-auth/ABKS1oUWnGIBO3mX5h56mgPvCbCU7PI3ks5q7-krgaJpZM4Kqw2_ .
So it would be more efficient to concat all of the datasets (subset for the relevant variables), and then just use a single .to_dataframe() call on the entire dataset? If so, that would require quite a bit of refactoring on my part, but it could be worth it.
So it would be more efficient to concat all of the datasets (subset for the relevant variables), and then just use a single .to_dataframe() call on the entire dataset? If so, that would require quite a bit of refactoring on my part, but it could be worth it.
Maybe? I'm not confident enough to advise you to go to that trouble.
Ok, no worries. I'll try it if it gets desperate :)
Thanks for your help, shoyer!
I am running into the same problem, this might be a long shot but @naught101 , do you remember if you managed to convert to dataframe in a more efficient way? Thanks,
can you make a reproducible example @andreall?
Hi,
import xarray as xr
from pathlib import Path
dir_input = Path('.')
data_ww3 = xr.open_mfdataset(dir_input.glob('**/' + 'WW3_EUR-11_CCCma-CanESM2_r1i1p1_CLMcom-CCLM4-8-17_v1_6hr_*.nc'))
data_ww3 = data_ww3.isel(latitude=74, longitude=18)
df_ww3 = data_ww3[['hs', 't02', 't0m1', 't01', 'fp', 'dir', 'spr', 'dp']].to_dataframe()
You can download one file here: https://nasgdfa.ugr.es:5001/d/f/566168344466602780 (3.5 GB). I did a profiler when opening 2 .nc files an it said the to_dataframe() call was the one taking most of the time.
I'm just wondering if there's a way to reduce computing time. I need to open 95 files and it takes about 1.5 hour.
Thanks,
Hi @andreall, I'll leave @dcherian or another maintainer to comment on internals of xarray
that might be pertinent for optimization here. However, just to throw it out there, for workflows like this, it can sometimes be a bit easier to process each NetCDF file (subsetting your locations and whatnot) and convert it to CSV individually, then merge/concatenate those CSV files together at the end. This sort of workflow can be parallelized a few different ways, but is nice because you can parallelize across the number of files you need to process. A simple example based on your MRE:
import xarray as xr
from pathlib import Path
from joblib import delayed, Parallel
dir_input = Path('.')
fns = list(sorted(dir_input.glob('**/' + 'WW3_EUR-11_CCCma-CanESM2_r1i1p1_CLMcom-CCLM4-8-17_v1_6hr_*.nc')))
# Helper function to convert NetCDF to CSV with our processing
def _nc_to_csv(fn):
data_ww3 = xr.open_dataset(fn)
data_ww3 = data_ww3.isel(latitude=74, longitude=18)
df_ww3 = data_ww3[['hs', 't02', 't0m1', 't01', 'fp', 'dir', 'spr', 'dp']].to_dataframe()
out_fn = fn.replace(".nc", ".csv")
df_ww3.to_csv(out_fn)
return out_fn
# Using joblib.Parallel to distribute my work across whatever resources i have
out_fns = Parallel(
n_jobs=-1, # Use all cores available here
delayed(_nc_to_csv)(fn) for fn in fns
)
# Read the CSV files and merge them
dfs = [
pd.read_csv(fn) for fn in out_fns
]
df_ww3_all = pd.concat(dfs, ignore_index=True)
YMMV but this pattern often works for many types of processing applications.
Hi @darothen , Thanks a lot..I hadn't thought of processing each file and then merging. Will give it a try, Thanks,
In order to maintain a list of currently relevant issues, we mark issues as stale after a period of inactivity
If this issue remains relevant, please comment here or remove the stale
label; otherwise it will be marked as closed automatically
Is there anything actionable that's outstanding here? Otherwise we can close...
I have the following chunk of code that gets used a lot in my scripts:
It basically extracts a few
data_vars
from a dataset, and converts it to a dataframe, limiting the axis to a single grid-cell (this particular data only has one location anyway). The first[data_vars]
call massively improve the efficiency (by dropping most variables before converting to a dataframe), the second one is to get rid of the x, y, and z in the dataframe (side-issue: it would be nice to have adrop_dims=
option in.to_dataframe
that dropped all dimensions of length 1)Here's an example of it in use:
This particular line of code eventually calls pandas.tslib.array_to_timedelta64, which takes up a significant chunk of my script's run time. My line of code doesn't look like it's the best way to do things, and I'm wondering if there's any way to get the same resulting data that's more efficient. Any help would be greatly appreciated.