Open dcherian opened 7 months ago
Thanks for raising this @dcherian. I support this, along with some effort to improve the docstrings/documentation/faq to make the default behaviour and potential pitfalls clearer.
What would a deprecation cycle for this look like? We're talking about changing the default values of 4 different keyword arguments, that get used in open_mfdataset
, concat
, combine_nested
and combine_by_coords
. Should we rip the bandaid off and change them all at once?
Should we rip the bandaid off and change them all at once?
+1. it would be massively confusing otherwise.
Regarding deprecation cycle. I think something like:
Broadly agree with @jsignell, one dissent:
If user does not specify each of those 4 kwargs raise a warning indicating the behavior will change in the future and suggest they use the setting option to opt in.
...iff the behavior from xarray would be different.
Otherwise this will be really noisy. Requires some work to assess whether the behavior will change in the concat
call, but I think worthwhile / the cost of making the change...
Is your feature request related to a problem?
The defaults for
concat
are excessively permissive:data_vars="all", coords="different", compat="no_conflicts", join="outer"
. This comment illustrates why this can be hard to predict or understand: a seemingly unrelated optiondecode_cf
controls whether a variable is indata_vars
orcoords
, and can result in wildly different concatenation behaviour.concat_dim
even if they did not have that dimension to begin with.concat_dim
) can result in very large datasets due to small floating points differences in the indexes, and also questionable behaviour with staggered grid datasets.duck_array_ops.nanfirst
here I think).While "convenient" this really just makes the default experience quite bad with hard-to-understand slowdowns.
Describe the solution you'd like
I propose we migrate to
data_vars="minimal", coords="minimal", join="exact", compat="override"
. This shoulddata_vars
andcoords
variables when they already haveconcat_dim
.concat_dim
, it will blindly pick them from the first file.join="exact"
will prevent ballooning of dimension sizes due to floating point inequalities.Unfortunately, this has a pretty big blast radius so we'd need a long deprecation cycle.
Describe alternatives you've considered
No response
Additional context
xref https://github.com/pydata/xarray/issues/4824 xref https://github.com/pydata/xarray/issues/1385 xref https://github.com/pydata/xarray/issues/8231 xref https://github.com/pydata/xarray/issues/5381 xref https://github.com/pydata/xarray/issues/2064 xref https://github.com/pydata/xarray/issues/2217