Track merging datatree into xarray

TomNicholas commented 9 months ago

What is your issue?

Master issue to track progress of merging xarray-datatree into xarray main. Would close https://github.com/pydata/xarray/issues/4118 (and many similar issues), as well as one of the goals of our development roadmap.

Also see the project board for DataTree integration.

On calls in the last few dev meetings, we decided to forget about a temporary cross-repo from xarray import datatree (so this issue supercedes #7418), and just begin merging datatree into xarray main directly.

Weekly meeting

See https://github.com/pydata/xarray/issues/8747

Task list:

To happen in order:

[x] open_datatree in xarray. This doesn't need to be performant initially, and ~~it would initially return a datatree.DataTree object.~~ EDIT: We decided it should return an xarray.DataTree object, or even xarray.core.datatree.DataTree object. So we can start by just copying the basic version in datatree/io.py right now which just calls open_dataset many times. #8697
[x] Triage and fix issues: figure out which of the issues on xarray-contrib/datatree need to be fixed before the merge (if any).
[x] Merge in code for DataTree class. I suggest we do this by making one PR for each module, and ideally discussing and merging each before opening a PR for the next module. (Open to other workflow suggestions though.) The main aim here being lowering the bus factor on the code, confirming high-level design decisions, and improving details of the implementation as it goes in.

Suggested order of modules to merge:
- [x] datatree/treenode.py - defines the tree structure, without any dimensions/data attached, #8757
- [x] datatree/datatree.py - adds data to the tree structure, #8789
- [x] datatree/iterators.py - iterates over a single tree in various ways, currently copied from anytree, #8879
- [x] datatree/mapping.py - implements map_over_subtree by iterating over N trees at once https://github.com/pydata/xarray/pull/8948,
- [x] datatree/ops.py - uses map_over_subtree to map methods like .mean over whole trees (https://github.com/pydata/xarray/pull/8976),
- [x] datatree/formatting_html.py - HTML repr, works but could do with some optimization https://github.com/pydata/xarray/pull/8930,
- [x] datatree/{extensions/common}.py - miscellaneous other features e.g. attribute-like access (#8967).
[ ] Expose datatree API publicly. Actually expose open_datatree and DataTree in xarray's public API as top-level imports. The full list of things to expose is:
- [ ] open_datatree
- [ ] DataTree
- [ ] map_over_subtree
- [ ] assert_isomorphic
- [ ] register_datatree_accessor
[ ] Refactor class inheritance - Dataset/DataArray share some mixin classes (e.g. DataWithCoords), and we could probably refactor DataTree to use these too. This is low-priority but would reduce code duplication.

Can happen basically at any time or maybe in parallel with other efforts:

[x] Generalize backends to support groups. Once a basic version of xr.open_datatree exists, we can start refactoring xarray's backend classes to support a general Backend.open_datatree method for any backend that can open multiple groups. Then we can make sure this is more performant than the naive implementation, i.e. only opening the file once. See also #8994.
[ ] Support backends other than netCDF and Zarr. - e.g. grib, see https://github.com/pydata/xarray/pull/7437,
[ ] Support dask properly - Issue https://github.com/xarray-contrib/datatree/pull/97 and the (stale) PR https://github.com/xarray-contrib/datatree/pull/196 are about dask parallelization over separate nodes in the tree.
[ ] Add other new high-level API methods - Things like .reorder_nodes and ideas we've only discussed like https://github.com/xarray-contrib/datatree/issues/79 and https://github.com/xarray-contrib/datatree/issues/254 (cc @dcherian who has had useful ideas here)
[ ] Copy xarray-contrib/datatree issues over to xarray's main repository. I think this is quite important and worth doing as a record of why decisions were made. (@jhamman and @TomNicholas)
[ ] Copy over any recent bug fixes from original datatree repository
[x] Look into merging commit history of xarray-contrib/datatree. I think this would be cool but is less important than keeping the issues. (@jhamman suggested we could do this using some git wizardry that I hadn't heard of before)
[ ] xarray.tutorial.open_datatree - I've been meaning to make a tutorial datatree object for ages. There's an issue about it, but actually now I think something close to the CMIP6 ensemble data that @jbusecke and I used in our pangeo blog post would already be pretty good. Once we have this it becomes much easier to write docs about some advanced features.
[ ] Merge Docs - I've tried to write these pages so that they should slot neatly into xarray's existing docs structure. Careful reading, additions and improvements would be great though. Summary of what docs exist on this issue https://github.com/xarray-contrib/datatree/issues/61
[ ] Write a blog post on the xarray blog highlighting xarray's new functionality, and explicitly thanking the NASA team for their work. Doesn't have to be long, it can just point to the documentation. https://github.com/xarray-contrib/xarray.dev/issues/708

Anyone is welcome to help with any of this, including but not limited to @owenlittlejohns , @eni-awowale, @flamingbear (@etienneschalk maybe?).

cc also @shoyer @keewis for any thoughts as to the process.

jhamman commented 9 months ago

Copy xarray-contrib/datatree issues over to xarray's main repository. I think this is quite important and worth doing as a record of why decisions were made. (@jhamman and @TomNicholas)

I think this will require temporarily moving the datatree repository to the pydata org then transferring issues one at a time to the xarray repo. I can help with the repo move when the time comes.

Look into merging commit history of xarray-contrib/datatree. I think this would be cool but is less important than keeping the issues. (@jhamman suggested we could do this using some git wizardry that I hadn't heard of before)

There are various ways to do this and I think it would be worth attempting. It would help preserve some of the iteration that datatree went through and make sure the attribution is carried through. This blog post explains one way to do this: https://gfscott.com/blog/merge-git-repos-and-keep-commit-history/

TomNicholas commented 9 months ago

It would help preserve some of the iteration that datatree went through and make sure the attribution is carried through.

That would be nice! But at least this method merges the entire history in one go it seems. What would our process of feedback be in that case? I'm worried about just merging the whole thing in and everyone just being like "yeah looks good 👍" without anyone else actually understanding how the code works...

keewis commented 9 months ago

you could create a new feature branch on the xarray repo (just to be safe) and put the datatree code in a "staging" area. Then copying over the modules one by one might work? Not sure if that breaks git blame, though.

Edit: the merge of that feature branch into main should not be a squash-merge though

lsterzinger commented 9 months ago

Thanks for putting this together @TomNicholas

Happy to help out with this however I can. Like I mentioned in the meeting last week, I'm not super familiar with the xarray backend but definitely willing to learn.

owenlittlejohns commented 9 months ago

open_datatree in xarray. This doesn't need to be performant initially, and it would initially return a datatree.DataTree object. So we can start by just copying the basic version in datatree/io.py right now which just calls open_dataset many times.

I was taking a quick look at this. Are you essentially saying we just need to copy the contents of datatree/io.py into xarray/backends/api.py (plus necessary tests to equivalent places)? Or do some of the things in io.py need to be migrated somewhere lower level than api.py?

TomNicholas commented 9 months ago

Are you essentially saying we just need to copy the contents of datatree/io.py into xarray/backends/api.py (plus necessary tests to equivalent places)? Or do some of the things in io.py need to be migrated somewhere lower level than api.py?

There are 3 levels of integration at which we could do this:

Add a standalone open_datatree function to xarray/backends/api.py that directly copies code from datatree/io.py.
Add an open_datatree method to xarray/backends/common.py::BackendEntrypoint and implement it for Zarr and netCDF backends using the same approach as in datatree/io.py (i.e. calling open_dataset once for each group).
Same as (2) but refactor the method to only open the file once.

I think you should try (2), falling back to (1) if that's too tricky, but deliberately leave (3) for a later PR.

eschalkargans commented 9 months ago

Hello,

If it can help, I found myself in a situation, quite similar however slightly different, where I had to merge two repos A and B into one (keeping A and archiving B), moving contents of A and B into new subfolders of the A repo, eg A/a, A/b. This differs in that we don't want to put the xarray code into a new xarray subdirectory. But maybe the procedure would be hopefully similar.

Here is a gist summarizing my procedure to do so: https://gist.github.com/eschalkargans/318d83e58d63d83454d1f8a497786a8d

keewis commented 8 months ago

I tried my hand at doing the merge, here's the result: datatree on keewis/xarray. This required two extra commits: one for moving the whole repository to a subdirectory (xarray/datatree_, note the underscore to mark it as temporary), and one merge commit.

If anyone wants to try, after adding xarray-contrib/datatree as a remote named datatree and switching to a feature branch I called:

git merge datatree/prepare-for-migration --no-commit --allow-unrelated-histories

where datatree/prepare-for-migration contains the commit moving the repository to the subdirectory.

flamingbear commented 8 months ago

So I wasn't able to come up with anything more clever¹ than what is above. If bringing the history over in one go to a temporary location is fine, I presume renaming the files into locations as we migrate will preserve the history as we move forward. I guess the next question is whether to feature branch after the import and merge to main or to try to do the migration steps into main proper? Pros and cons to both, but would lean towards directly into main. Thoughts/feelings?

I wanted to bring in the history with each PR to main, but after actually thinking about how that would work, it doesn't.

TomNicholas commented 8 months ago

Thanks all three of you for trying this!

I tried my hand at doing the merge, here's the result: datatree on keewis/xarray. This required two extra commits: one for moving the whole repository to a subdirectory (xarray/datatree_, note the underscore to mark it as temporary), and one merge commit.

preserve the history

Is it just me or has this approach not actually preserved the history at all? All the datatree code seems to be squashed into one massive commit: https://github.com/keewis/xarray/commit/4227b38bdf387864a1397fb3f7325d5b8dc02b64

I guess the next question is whether to feature branch after the import and merge to main or to try to do the migration steps into main proper?

I think we don't really need a feature branch as there are no backwards compatibility issues? (@shoyer unless you have a preference?) Also the datatree code can be merged into main without actually exposing DataTree as public API until we're ready.

@eschalkargans we appreciate your interest in this! FYI the rest of us on this thread met yesterday in xarray's bi-weekly community dev call, which you would be more than welcome to join for. 😄 Regardless we will try to write out any decisions we make in that meeting also on github for reference.

keewis commented 8 months ago

Is it just me or has this approach not actually preserved the history at all?

no, that's just the commit for importing the datatree repository. Try looking here: https://github.com/keewis/xarray/blame/datatree/xarray/datatree_/datatree/formatting.py

Edit: I don't have any preferences about feature branch vs. merging directly into main. However, the downside of the separate feature branch is that we need to make sure we do a normal merge instead of a squash merge, plus we'd need a lot of merge commits to keep up with main.

kmuehlbauer commented 8 months ago

Is it just me or has this approach not actually preserved the history at all?

no, that's just the commit for importing the datatree repository. Try looking here: https://github.com/keewis/xarray/blame/datatree/xarray/datatree_/datatree/formatting.py

That might be just minor, but the PR links are pointing now to the xarray repo unrelated PR's. Not sure we can do anything about. Could these links (eg. #50) be stripped or relocated to datatree repo during the process to avoid any issues?

slevang commented 8 months ago

That might be just minor, but the PR links are pointing now to the xarray repo unrelated PR's. Not sure we can do anything about. Could these links (eg. #50) be stripped or relocated to datatree repo during the process to avoid any issues?

We could rewrite the commit message history, using a tool like git-filter-repo, to explicitly repoint e.g. #300 to https://github.com/xarray-contrib/datatree/pull/300. These look exactly the same in Github due to the automatic markdown parsing. I tried a simple #==>https://github.com/xarray-contrib/datatree/pull/ replacement pattern which seems to work but could do something smarter with regex to make sure we don't mess with isolated uses of # in commit messages.

kmuehlbauer commented 8 months ago

@slevang Great this works. Could we also change comments in code (where maybe issues) are linked the same way?

slevang commented 8 months ago

Yep, there is a --replace-text option which handles file contents in addition to --replace-message.

https://htmlpreview.github.io/?https://github.com/newren/git-filter-repo/blob/docs/html/git-filter-repo.html

flamingbear commented 8 months ago

steps to prepare datatree for xarray import

clone datatree repository locally.
use git-filter-repo to change tag names to `legacy-datatree-` (suggestions if not this name?) e.g. v0.0.13 => legacy-datatree-v0.0.13
```
git filter-repo --tag-rename '':'legacy-datatree-'
```

use git-filter-repo to rewrite PR titles that use github automatic markdown to full paths in datatree.

The pull request changes are replaced with this --replace-message file:

# standard github style pull request
regex:\(#(\d+)\)==>https://github.com/xarray-contrib/datatree/pull/\1

# I found a few like "Merge pull request #11 from TomNicholas/single_datanode_class"
regex:pull request #(\d+)==>https://github.com/xarray-contrib/datatree/pull/\1

Comments in code should point to issues --replace-text file:

# standard comment change "  # see issue #38"
regex:(\s*?#.*[ ])#(\d+)==>\1https://github.com/xarray-contrib/datatree/issues/\2

# some like "    @pytest.mark.xfail(reason="Indexing needs to return whole tree (GH #77)")"
regex:(\(GH #(\d+)\))==>(GH https://github.com/xarray-contrib/datatree/issues/\2)

Run the git-filter-repo command (--dry-run is be added to examine the changes that are made) To do the rewriting I did it from my cloned repo directly.
```
git-filter-repo --replace-text ../replace-text-issue --replace-message ../replace-message-pr
```
You can see the results here https://github.com/flamingbear/rewritten-datatree commits on main all have links to xarray-contrib/datatree#PR now

I think if you take these steps before running the @keewis import steps, you might be ready to go.

flamingbear commented 8 months ago

Also, I added steps 6. and 7. to make sure they worked as I expected

push rewritten-datatree code down into temp directory. As an alternate to a new commit, you can use git-filter-repo. From the rewritten-datatree directory run.
```
  git-filter-repo --to-subdirectory xarray/datatree_
```
Merge the prepared datatree into xarray.
- add remote named prepared-datatree to local xarray repo
- check out branch datatree-import
- git merge prepared-datatree/main --no-commit --allow-unrelated-histories
- push up to fork https://github.com/flamingbear/original-fork-xarray

So my fork is (close to) what I would expect xarray to look like after we merge the repositories. And I can start work on the first code task open_datatree (that I will copy over when we do this for real)

A set of questions: How long will datatree be alive during this process?
How do we keep these repositories in sync?
Can we pause development on datatree? How do we move the github history over? Or do we just move open issues and PRs?

slevang commented 8 months ago

I can't think of a sensible way to keep these in sync. Archiving datatree as soon as the merge happens would force future development into xarray. The datatree repo would remain readable for referencing old issues.

Looks like we could move open issues, of which there are currently 57, with something like this.

Moving PRs seems much harder because every PR would have to be rebased onto the xarray repo somehow. There are 21 open PRs, and only 7 opened in the last 6 months. Maybe this is better handled manually by individual contributors?

TomNicholas commented 8 months ago

@flamingbear @slevang this is excellent!

A set of questions:

How long will datatree be alive during this process?

To download? The whole time, so that people can use it. Under active development? I don't think it really needs to be.

How do we keep these repositories in sync?

I'm not sure it's hugely important to. The rate of development on datatree has slowed to a crawl (which is why I think it's mature enough to start moving it - people seem pretty happy to use it as is).

Can we pause development on datatree?

I would be happy to put some kind of disclaimer on the datatree readme saying that whilst we accept new PRs the repo will be deprecated and any PRs since [date we begin merging into xarray] might be later copied across to xarray without full git attribution.

How do we move the github history over? Or do we just move open issues and PRs?

Maybe [open PRs are] better handled manually by individual contributors?

Open issues are important to retain ideally, as are closed issues (they document the design process, and I tried hard to write that down in public). Open PRs I'm not so bothered about. Once you rule out the open PRs that are either (a) me, (b) stale, (c) trivial, there are only like 2 or 3 left, which can be handled manually.

Other opinions welcome, but I would suggest that we just release a new version of datatree (as it's overdue), and once we've done that we can put a disclaimer on the repo?

TomNicholas commented 8 months ago

we just release a new version of datatree (as it's overdue), and once we've done that we can put a disclaimer on the repo?

I've merged a couple of things, released v0.0.14, and added a migration notice to datatree's readme.

flamingbear commented 8 months ago

I've merged a couple of things, released v0.0.14, and added a migration notice to datatree's readme.

Excellent. I will start steps 1-7 again with the latest xarray and datatree.

flamingbear commented 8 months ago

There's the PR for the import https://github.com/pydata/xarray/pull/8656 [edit: looks like I will need to give it some attention for CI]

flamingbear commented 8 months ago

Is there a tool you all use to test CI locally? I force pushed myself into a hole and would like to avoid that moving forward.

TomNicholas commented 8 months ago

Is there a tool you all use to test CI locally?

I just run pytest locally, then push to a PR and hope for the best 😅 If anyone has a nifty tool it would be @max-sixty or @keewis though

flamingbear commented 8 months ago

I'm specifically thinking of the "build xarray from source" to make sure I haven't introduced circular dependencies xarray -> datatree -> xarray in the open_datatree step I'm working on.

keewis commented 8 months ago

I don't, I usually just follow the setup instructions from the document you posted: create conda env, activate it, install xarray in editable mode, run pytest. And really, this is what CI is doing, as well, just with a matrix of OSs / python versions.

TomNicholas commented 7 months ago

We're going to have a weekly meeting about this integration - see https://github.com/pydata/xarray/issues/8747

TomNicholas commented 2 months ago

So in order to make a first "beta" release of datatree functionality in xarray main, we need to:

Coordinate inheritance:
- [x] Decide on and implement coordinate inheritance (#9063 was a blocker so thank you @shoyer)
- [x] Clarify which coordinates are inherited in the repr (#9463, https://github.com/pydata/xarray/pull/9470)
- [ ] ...and in assertions https://github.com/pydata/xarray/issues/9215, https://github.com/pydata/xarray/pull/9473
- [ ] Prevent duplicate coordinates https://github.com/pydata/xarray/issues/9475
- [ ] Minor alterations to docs to reflect above fixes
[x] Fix some issues with mutation + other miscellaneous but important bugs:
[ ] Merge the docs PR that actually publicly exposes datatree https://github.com/pydata/xarray/pull/9033
- [x] Some wording changes are needed first to reflect the fact coordinates are now inherited
[x] Add the open_groups function https://github.com/pydata/xarray/issues/9137 (@eni-awowale has started #9243
- [x] open_groups should also work for open zarr stores https://github.com/pydata/xarray/pull/9469
[x] Fix the ability to add backend keyword args https://github.com/pydata/xarray/issues/9135
[ ] Move/close old issues from xarray-datatree repo, making sure each one is represented by a new issue on xarray main repo
- [ ] Archive the original repository
[x] Docs explanation of coordinate inheritance (initial explanation in https://github.com/pydata/xarray/pull/9033)

There's a long tail of work after this but IMO this is what's required for the MVP.

cc @shoyer @flamingbear @owenlittlejohns @eni-awowale @keewis

eni-awowale commented 2 months ago

@TomNicholas One more thing to add to our MVP list is to migrate all of the issues in DataTree over to xarray.

pydata / xarray