pydata / xarray

N-D labeled arrays and datasets in Python
https://xarray.dev
Apache License 2.0
3.57k stars 1.07k forks source link

Stricter defaults for concat, combine, open_mfdataset #8778

Open dcherian opened 7 months ago

dcherian commented 7 months ago

Is your feature request related to a problem?

The defaults for concat are excessively permissive: data_vars="all", coords="different", compat="no_conflicts", join="outer". This comment illustrates why this can be hard to predict or understand: a seemingly unrelated option decode_cf controls whether a variable is in data_vars or coords, and can result in wildly different concatenation behaviour.

  1. This always concatenates data_vars along concat_dim even if they did not have that dimension to begin with.
  2. If the same coordinate var exists in different datasets/files, they will be sequentially compared for equality to decide whether they get concatenated.
  3. The outer join (applied along all dimensions that are not concat_dim) can result in very large datasets due to small floating points differences in the indexes, and also questionable behaviour with staggered grid datasets.
  4. "no_conflicts" basically picks the first not-NaN value after aligning all datasets, but is quite slow (we should be using duck_array_ops.nanfirst here I think).

While "convenient" this really just makes the default experience quite bad with hard-to-understand slowdowns.

Describe the solution you'd like

I propose we migrate to data_vars="minimal", coords="minimal", join="exact", compat="override". This should

  1. only concatenate data_vars and coords variables when they already have concat_dim.
  2. For any variables that do not have concat_dim, it will blindly pick them from the first file.
  3. join="exact" will prevent ballooning of dimension sizes due to floating point inequalities.
  4. These options will totally avoid any data reads unless explicitly requested by the user.

Unfortunately, this has a pretty big blast radius so we'd need a long deprecation cycle.

Describe alternatives you've considered

No response

Additional context

xref https://github.com/pydata/xarray/issues/4824 xref https://github.com/pydata/xarray/issues/1385 xref https://github.com/pydata/xarray/issues/8231 xref https://github.com/pydata/xarray/issues/5381 xref https://github.com/pydata/xarray/issues/2064 xref https://github.com/pydata/xarray/issues/2217

TomNicholas commented 7 months ago

Thanks for raising this @dcherian. I support this, along with some effort to improve the docstrings/documentation/faq to make the default behaviour and potential pitfalls clearer.

What would a deprecation cycle for this look like? We're talking about changing the default values of 4 different keyword arguments, that get used in open_mfdataset, concat, combine_nested and combine_by_coords. Should we rip the bandaid off and change them all at once?

dcherian commented 7 months ago

Should we rip the bandaid off and change them all at once?

+1. it would be massively confusing otherwise.

jsignell commented 4 months ago

Regarding deprecation cycle. I think something like:

max-sixty commented 4 months ago

Broadly agree with @jsignell, one dissent:

If user does not specify each of those 4 kwargs raise a warning indicating the behavior will change in the future and suggest they use the setting option to opt in.

...iff the behavior from xarray would be different.

Otherwise this will be really noisy. Requires some work to assess whether the behavior will change in the concat call, but I think worthwhile / the cost of making the change...