pydata / xarray

N-D labeled arrays and datasets in Python
https://xarray.dev
Apache License 2.0
3.62k stars 1.08k forks source link

open_mfdataset: skip loading for indexes and coordinates from all but the first file #2039

Open crusaderky opened 6 years ago

crusaderky commented 6 years ago

This is a follow-up from #1521.

When invoking open_mfdataset, very frequently the user knows in advance that all of his coords that aren't on the concat_dim are already aligned, and may be willing to blindly trust such assumption in exchange of a huge performance boost.

My production data: 200x NetCDF files on a not very performant NFS file system, concatenated on the "scenario" dimension:

xarray.open_mfdataset('cube.*.nc', engine='h5netcdf', concat_dim='scenario')

<xarray.Dataset>
Dimensions:      (attribute: 1, fx_id: 40, instr_id: 10765, scenario: 500001, timestep: 1)
Coordinates:
  * attribute    (attribute) object 'THEO/Value'
    currency     (instr_id) object 'ZAR' 'EUR' 'EUR' 'EUR' 'EUR' 'EUR' 'GBP' ...
  * fx_id        (fx_id) object 'GBP' 'USD' 'EUR' 'JPY' 'ARS' 'AUD' 'BRL' ...
  * instr_id     (instr_id) object 'S01626556_ZAE000204921' '537805_1275' ...
  * timestep     (timestep) datetime64[ns] 2016-12-31
    type         (instr_id) object 'American' 'Bond Future' 'Bond Future' ...
  * scenario     (scenario) object 'Base Scenario' 'SSMC_1' 'SSMC_2' ...
Data variables:
    FX           (fx_id, timestep, scenario) float64 dask.array<shape=(40, 1, 500001), chunksize=(40, 1, 2501)>
    instruments  (instr_id, attribute, timestep, scenario) float64 dask.array<shape=(10765, 1, 1, 500001), chunksize=(10765, 1, 1, 2501)>

CPU times: user 19.6 s, sys: 981 ms, total: 20.6 s
Wall time: 24.4 s

If I skip loading and comparing the non-index coords from all 200 files:

xarray.open_mfdataset('cube.*.nc'), engine='h5netcdf', concat_dim='scenario', coords='all')

<xarray.Dataset>
Dimensions:      (attribute: 1, fx_id: 40, instr_id: 10765, scenario: 500001, timestep: 1)
Coordinates:
  * attribute    (attribute) object 'THEO/Value'
  * fx_id        (fx_id) object 'GBP' 'USD' 'EUR' 'JPY' 'ARS' 'AUD' 'BRL' ...
  * instr_id     (instr_id) object 'S01626556_ZAE000204921' '537805_1275' ...
  * timestep     (timestep) datetime64[ns] 2016-12-31
    currency     (scenario, instr_id) object dask.array<shape=(500001, 10765), chunksize=(2501, 10765)>
  * scenario     (scenario) object 'Base Scenario' 'SSMC_1' 'SSMC_2' ...
    type         (scenario, instr_id) object dask.array<shape=(500001, 10765), chunksize=(2501, 10765)>
Data variables:
    FX           (fx_id, timestep, scenario) float64 dask.array<shape=(40, 1, 500001), chunksize=(40, 1, 2501)>
    instruments  (instr_id, attribute, timestep, scenario) float64 dask.array<shape=(10765, 1, 1, 500001), chunksize=(10765, 1, 1, 2501)>

CPU times: user 12.7 s, sys: 305 ms, total: 13 s
Wall time: 14.8 s

If I skip loading and comparing also the index coords from all 200 files:

cube = xarray.open_mfdataset(sh.resolve_env(f'{dynamic}/mtf/{cubename}/nc/cube.*.nc'), engine='h5netcdf',
                             concat_dim='scenario', 
                             drop_variables=['attribute', 'fx_id', 'instr_id', 'timestep', 'currency', 'type'])

<xarray.Dataset>
Dimensions:      (attribute: 1, fx_id: 40, instr_id: 10765, scenario: 500001, timestep: 1)
Coordinates:
  * scenario     (scenario) object 'Base Scenario' 'SSMC_1' 'SSMC_2' ...
Dimensions without coordinates: attribute, fx_id, instr_id, timestep
Data variables:
    FX           (fx_id, timestep, scenario) float64 dask.array<shape=(40, 1, 500001), chunksize=(40, 1, 2501)>
    instruments  (instr_id, attribute, timestep, scenario) float64 dask.array<shape=(10765, 1, 1, 500001), chunksize=(10765, 1, 1, 2501)>

CPU times: user 7.31 s, sys: 61 ms, total: 7.37 s
Wall time: 9.05 s

Proposed design

Add a new optional parameter to open_mfdataset, assume_aligned=None. It can be valued to a list of variable names or "all", and requires concat_dim to be explicitly set. It causes open_mfdataset to use the first occurrence of every variable and blindly skip loading the subsequent ones.

Algorithm

  1. Perform the first invocation to the underlying open_dataset like it happens now
  2. if assume_aligned is not None: for each new NetCDF file, figure out which variables need to be aligned & compared (as opposed to concatenated), and add them to a drop_variables list.
  3. if assume_aligned != "all": drop_variables &= assume_aligned
  4. Pass the increasingly long drop_variables list to the underlying open_dataset
rabernat commented 6 years ago

I agree it would be great to have this feature.

There has already been lots discussion of this on #1385 and #1823. I tried and failed to implement something similar in #1413. I recommend reviewing those threads before jumping in to this.