pydata / xarray

N-D labeled arrays and datasets in Python
https://xarray.dev
Apache License 2.0
3.61k stars 1.08k forks source link

slow performance with open_mfdataset #1385

Open rabernat opened 7 years ago

rabernat commented 7 years ago

We have a dataset stored across multiple netCDF files. We are getting very slow performance with open_mfdataset, and I would like to improve this.

Each individual netCDF file looks like this:

%time ds_single = xr.open_dataset('float_trajectories.0000000000.nc')
ds_single
CPU times: user 14.9 ms, sys: 48.4 ms, total: 63.4 ms
Wall time: 60.8 ms

<xarray.Dataset>
Dimensions:  (npart: 8192000, time: 1)
Coordinates:
  * time     (time) datetime64[ns] 1993-01-01
  * npart    (npart) int32 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 ...
Data variables:
    z        (time, npart) float32 -0.5 -0.5 -0.5 -0.5 -0.5 -0.5 -0.5 -0.5 ...
    vort     (time, npart) float32 -9.71733e-10 -9.72858e-10 -9.73001e-10 ...
    u        (time, npart) float32 0.000545563 0.000544884 0.000544204 ...
    v        (time, npart) float32 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ...
    x        (time, npart) float32 180.016 180.047 180.078 180.109 180.141 ...
    y        (time, npart) float32 -79.9844 -79.9844 -79.9844 -79.9844 ...

As shown above, a single data file opens in ~60 ms.

When I call open_mdsdataset on 49 files (each with a different time dimension but the same npart), here is what happens:

%time ds = xr.open_mfdataset('*.nc', )
ds
CPU times: user 1min 31s, sys: 25.4 s, total: 1min 57s
Wall time: 2min 4s

<xarray.Dataset>
Dimensions:  (npart: 8192000, time: 49)
Coordinates:
  * npart    (npart) int64 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 ...
  * time     (time) datetime64[ns] 1993-01-01 1993-01-02 1993-01-03 ...
Data variables:
    z        (time, npart) float64 -0.5 -0.5 -0.5 -0.5 -0.5 -0.5 -0.5 -0.5 ...
    vort     (time, npart) float64 -9.717e-10 -9.729e-10 -9.73e-10 -9.73e-10 ...
    u        (time, npart) float64 0.0005456 0.0005449 0.0005442 0.0005437 ...
    v        (time, npart) float64 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ...
    x        (time, npart) float64 180.0 180.0 180.1 180.1 180.1 180.2 180.2 ...
    y        (time, npart) float64 -79.98 -79.98 -79.98 -79.98 -79.98 -79.98 ...

It takes over 2 minutes to open the dataset. Specifying concat_dim='time' does not improve performance.

Here is %prun of the open_mfdataset command.

         748994 function calls (724222 primitive calls) in 142.160 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       49   62.455    1.275   62.458    1.275 {method 'get_indexer' of 'pandas.index.IndexEngine' objects}
       49   47.207    0.963   47.209    0.963 base.py:1067(is_unique)
      196    7.198    0.037    7.267    0.037 {operator.getitem}
       49    4.632    0.095    4.687    0.096 netCDF4_.py:182(_open_netcdf4_group)
      240    3.189    0.013    3.426    0.014 numeric.py:2476(array_equal)
       98    1.937    0.020    1.937    0.020 {numpy.core.multiarray.arange}
4175/3146    1.867    0.000    9.296    0.003 {numpy.core.multiarray.array}
       49    1.525    0.031  119.144    2.432 alignment.py:251(reindex_variables)
       24    1.065    0.044    1.065    0.044 {method 'cumsum' of 'numpy.ndarray' objects}
       12    1.010    0.084    1.010    0.084 {method 'sort' of 'numpy.ndarray' objects}
5227/4035    0.660    0.000    1.688    0.000 collections.py:50(__init__)
       12    0.600    0.050    3.238    0.270 core.py:2761(insert)
12691/7497    0.473    0.000    0.875    0.000 indexing.py:363(shape)
   110728    0.425    0.000    0.663    0.000 {isinstance}
       12    0.413    0.034    0.413    0.034 {method 'flatten' of 'numpy.ndarray' objects}
       12    0.341    0.028    0.341    0.028 {numpy.core.multiarray.where}
        2    0.333    0.166    0.333    0.166 {pandas._join.outer_join_indexer_int64}
        1    0.331    0.331  142.164  142.164 <string>:1(<module>)

It looks like most of the time is being spent on reindex_variables. I understand why this happens...xarray needs to make sure the dimensions are the same in order to concatenate them together.

Is there any obvious way I could improve the load time? For example, can I give a hint to xarray that this reindex_variables step is not necessary, since I know that all the npart dimensions are the same in each file?

Possibly related to #1301 and #1340.

rabernat commented 7 months ago

This issue is almost seven years old! It has been "fixed" many times since my original post, but people keep finding new ways to make it reappear. 😆

It seems like having better diagnostics / logging of what is happening under the hood with open_mfdataset is what people really need. Maybe even some sort of utility to pre-scan the files and figure out if they are easily openable or not.

jtomfarrar commented 7 months ago

This issue is almost seven years old! It has been "fixed" many times since my original post, but people keep finding new ways to make it reappear. 😆

It seems like having better diagnostics / logging of what is happening under the hood with open_mfdataset is what people really need. Maybe even some sort of utility to pre-scan the files and figure out if they are easily openable or not.

Both of those seem like great ideas. Maybe there could be a verbose or logging mode to help users identify what is wrong with the files (e.g., where the time is being spent and whether that seems suspicious). It is probably true that people (like me) will keep finding new ways to generate problematic netcdf files. (I'm sure we can think of something even worse than 20 Hz data referenced to a time origin 75 years ago).