pydata / xarray

N-D labeled arrays and datasets in Python
https://xarray.dev
Apache License 2.0
3.54k stars 1.06k forks source link

Rules for propagating attrs and encoding #1614

Open jhamman opened 6 years ago

jhamman commented 6 years ago

We need to come up with some clear rules for when and how xarray should propagate metadata (attrs/encoding). This has come up routinely (e.g. #25, #138, #442, #688, #828, #988, #1009, #1271, #1297, #1586) and we don't have a clear direction as to when to keep/drop metadata.

I'll take a first cut:

operation attrs encoding status
reduce drop drop
arithmetic drop drop implemented
copy keep keep
concat keep first keep first implemented
slice keep drop
where keep keep

cc @shoyer (following up on https://github.com/pydata/xarray/issues/1586#issuecomment-334954046)

ethan-campbell commented 6 years ago

I'd also suggest that a global option of always_keep_attrs=True would be useful. While I understand the logic of dropping units during certain operations, it makes attributes unusable for storing other miscellaneous metadata, e.g. on data provenance. As a recent xarray convert, this behavior has been frustrating.

mraspaud commented 6 years ago

This issue is very relevant for me too. I would like to also propose that a user could provide a function that would know how to combine the attrs of different DataArrays.

brey commented 6 years ago

I am also interested. In terms of the table from @jhamman I am in principle ok with. However, there could be an option to refer to the original attrs in order to provide provenance even on operations like reduce and arithmetic. The idea here is reproducibility and tractability. Maybe an 'origin' attribute?

shoyer commented 6 years ago

The challenge with a user-specified function is that there can potentially be weird conflicts if multiple libraries try to override it. Possibly it's worth it for the convenience, but subclasses allowing for explicit hooks (like numpy) is probably the cleanest solution.

SeanDS commented 6 years ago

Hi, this feature would be very relevant to the intended use case of a project I'd like to use xarray for. Is the behaviour discussed in the first post implemented anywhere, e.g. in the trunk, for me to play with?

SeanDS commented 6 years ago

Also - might I suggest you consider some kind of history tracker as part of the metadata propagation? Perhaps metadata could be saved from each step of a set of operations, so that there is a full paper trail for the set of operations have been applied to the data. It could however get complicated to merge together objects with their own separate histories, especially if they ultimately descend from the same original data set.

This would be very relevant for scientific analyses.

shoyer commented 6 years ago

Hi, this feature would be very relevant to the intended use case of a project I'd like to use xarray for. Is the behaviour discussed in the first post implemented anywhere, e.g. in the trunk, for me to play with?

are you referring to a different issue? the first post only summarizes some simple proposed rules.

shoyer commented 6 years ago

Also - might I suggest you consider some kind of history tracker as part of the metadata propagation?

Certainly this would be out of scope for xarray itself, but this perhaps be done with a library that wraps xarray's API. If I recall correctly, @pwolfram was also interested in this.

We did discuss customizable hooks for attribute handling in #988 but I'm no longer sure that is a good idea. These sort of overloads are really hard to get right, as we've seen with NumPy's long history of different override protocols (the most recent example being __array_ufunc__).

max-sixty commented 6 years ago

consider some kind of history tracker as part of the metadata propagation?

Data lineage is a big, hard, unsolved problem (~for us~ internally, above both naming things and cache invalidation :) )

To second @shoyer, I think it's big and difficult enough to be a separate library

SeanDS commented 6 years ago

are you referring to a different issue? the first post only summarizes some simple proposed rules.

No, just the proposed feature to keep or delete metadata based on the various operations. Is this behaviour already part of the library, and this issue is just to clarify the intended behaviour, or is this a feature proposal?

shoyer commented 6 years ago

No, just the proposed feature to keep or delete metadata based on the various operations. Is this behaviour already part of the library, and this issue is just to clarify the intended behaviour, or is this a feature proposal?

We already have most of this behavior (matching what @jhamman lists in the first comment), though it isn't clearly documented. It should just work if you use xarray methods/functions.

ethan-campbell commented 6 years ago

@shoyer, I assume you are referring to the keep_attrs option. Is there a way to persist attrs during arithmetic options? I find myself writing a bunch of boilerplate to transfer the wealth of metadata included with most netCDF files.

I realize that adding a module-level or DataArray instance-specific maintain_attrs configuration flag (as discussed in #131, #988, #1271) could be problematic, but this strikes me as complexity worth adding. The current approach of dropping all metadata (not just units) seems heavy-handed and unintuitive for new/casual users. As you mentioned in #1271, better to have stale metadata than no metadata at all.

shoyer commented 6 years ago

I would happy to add a global keep_attrs option to xarray.set_options(), which we could use for controlling arithmetic. I'm not planning on working on it personally, but I would be happy to review a PR.

gerritholl commented 5 years ago

Another one to decide is xarray.zeros_like(...) and friends.

shoyer commented 5 years ago

I would happy to add a global keep_attrs option to xarray.set_options(), which we could use for controlling arithmetic. I'm not planning on working on it personally, but I would be happy to review a PR.

Note that this was implemented by @TomNicholas in https://github.com/pydata/xarray/pull/2482