sunpy / ndcube

A base package for multi-dimensional contiguous and non-contiguous coordinate-aware arrays. "Maintainers": @danryanirish & @Cadair
http://docs.sunpy.org/projects/ndcube/
BSD 2-Clause "Simplified" License
44 stars 47 forks source link

Create a to/from xarray method/function #222

Open DanRyanIrish opened 4 years ago

DanRyanIrish commented 4 years ago

Description

Support of gWCS in ndcube enables the possibility of converting between an ndcube and xarray objects. Although ndcube and xarray will still be preferred in different situations, the ability oto convert between them may still be highly useful. Is this something ndcube should support?

Cadair commented 2 years ago

We had a very productive discussion with Martin Durant @martindurant and Jim Bednar at the sunpy coordination meeting today. I think that as far as WCS / xarray integration goals go from an ndcube perspective this issue is probably the best place to start.

As I mentioned in the discussion today I am very happy to bring the WCS expertise to any effort to build out xarray support for WCSes. I suggest we use this issue as a good place to coordinate this work and discussion with other interested parties.

Also relevant to this is @eteq's issue https://github.com/pydata/xarray/issues/3620

martindurant commented 2 years ago

cc @jbednar , who will get in touch when his team is ready

ianthomas23 commented 2 years ago

Hi all! I am the third member of the @jbednar and @martindurant trio. Sorry I couldn't attend the discussion a couple of weeks ago, but I have watched the video of the conversation.

I am going to be at the NumFOCUS Project Summit in 2 weeks and hope to have a face-to-face chat with @wtbarnes. Hopefully we can involve the xarray folks too. I am happy to chat remotely before that if it helps!

Cadair commented 2 years ago

hey @ianthomas23 :wave:

Unfortunately I wont be at the NumFOCUS summit, but I am happy to chat either before or afterwards. I am super keen to see how far we can get with WCS + xarray and how to integrate it into ndcube.

Do you have a plan for what you are going to be working on yet?

ianthomas23 commented 2 years ago

Do you have a plan for what you are going to be working on yet?

I think we have to have a pragmatic 2-stage approach. Firstly write some experimental code using WCS (astropy.wcs?) that is reasonably correct and reasonably performant so that we can progress with the rest of our project. This gives us a bit more time and space for stage 2 which is to find the correct long-term solution. Exactly what stage 2 looks like has yet to be determined, as does where the solution lies and who works on it!

We are hoping to talk to some xarray devs next week sometime to get their latest updates. Maybe we should have a chat after that?

benbovy commented 2 years ago

We are hoping to talk to some xarray devs next week sometime to get their latest updates. Maybe we should have a chat after that?

I'd be interested to join the discussion. We've made progress on Xarray custom indexes recently and hopefully it should be ready for "public" use (although still very experimental) with the next release. I'm happy to present about the latest updates on this front. I'd also like to allocate some time on working on or helping domain experts experimenting with custom indexes. From what I understand in https://github.com/pydata/xarray/issues/3620, WCS + xarray looks like a nice fit for testing this feature!

wtbarnes commented 2 years ago

Hi @ianthomas23. Great to hear you'll be at the NumFocus Summit next week. Though I'll be the only SunPy maintainer present, I believe there will also be several astropy maintainers there in person as well who may be interested in this discussion or who can at least provide some WCS expertise.

ianthomas23 commented 1 year ago

Following the in-person meeting last week at the NumFOCUS Project Summit, I will progress with some standalone code to go from FITS file to Xarray using flexible indices to generate WCS coordinates using the AstroPy WCS module. This will probably only support the subset of WCS that our current project requires. Other meeting attendees to provide guidance: @eteq (AstroPy), @wtbarnes and @Cadair (SunPy), @andersy005 (Xarray).

The subsequent longer-term goal is that this code will become part of wider ndcube/Xarray interoperability.

ianthomas23 commented 1 year ago

Here is some progress to report: https://nbviewer.org/gist/ianthomas23/bbf85d1a38f8f161a2c1f641d9465096.

This is a Jupyter Notebook that uses an xarray.Accessor to convert from pixel to world coordinates using astropy transforms. It follows the approach of https://github.com/pydata/xarray/issues/3620#issuecomment-855710036 in using a pixel_to_world accessor function that returns a new xarray.Dataset with WCS coords. The next comment in that issue describes an alternative approach that combines both pixel and world coordinates in the same Dataset.

This is pretty much everything that I need in the short term. It is quite specific to my test data and will need generalising to other use cases. I have played around a bit with a functional index (xarray.core.indexes.Index) a little as well, to allow e.g. slicing of pixel indexes based on WCS values, but I don't have anything concrete from that as I don't think I need it.

Probably not strictly relevant to target audience here, but if you download and run the notebook locally you can interactively zoom and pan the image using the mouse and controls on the right.

ianthomas23 commented 1 year ago

There is also this recent discussion on the Pangeo discourse https://discourse.pangeo.io/t/vector-data-cubes/2904 about vector data cubes in xarray that might be relevant to the long-term goal of greater ndcube/xarray interoperability.

Cadair commented 1 year ago

Hi @ianthomas23 great to see some progress :rocket:

Is it possible with a functional index to not have to compute the grid of all the coordinates (in a way more nuanced than making a delayed dask call)? For some of my datasets the coordinate grid is prohibitively large, and one of the key things I think that the new functionality in xarray provides would be the ability to only call the WCS when computing the world coordinates is required?

I am curious about why you chose to use WCSPixel2WorldTransform rather than using the WCS classes methods? Specifically I expect pixel_to_world_values would be useful for you.

benbovy commented 1 year ago

Thanks for sharing @ianthomas23.

I wanted to give a try at creating an Xarray WCSIndex, so I started from your notebook and uploaded a new version here.

In this version, I updated the computation of the world coordinates so that these are fully lazy (i.e., x_wcs and y_wcs dask arrays are created from scratch). Hopefully it should scale to large grids reasonably well. I'm sure it can be further optimized and improved, though, there's not much functionality. I only implemented point-wise selection of world coordinates (not sure how to implement selection of slices from WCS coordinates).

Cadair commented 1 year ago

hey @benbovy thanks for the example with the lazy loading! I assume that all the xarray machinery requires an array of coordinates to be generated, lazy or not? There's no functionality for computing them on the fly?

Do you happen to have a link to the best place for me to learn more about what's shown in your example, i.e how Index classes work and what custom ones should support? (Basically asking so I can attempt to answer how to do other selections).

ianthomas23 commented 1 year ago

I am curious about why you chose to use WCSPixel2WorldTransform rather than using the WCS classes methods? Specifically I expect pixel_to_world_values would be useful for you.

@Cadair I just used whatever I found that worked for me. I am happy to take advice on a better approach.

benbovy commented 1 year ago

I assume that all the xarray machinery requires an array of coordinates to be generated, lazy or not? There's no functionality for computing them on the fly?

Yes, an Xarray index must always be attached to one or more coordinates in a Dataset or DataArray. That said, the model is quite flexible regarding how to arrange coordinates and indexes and what to do in indexes.

Actually, it could be possible to create a WCSIndex from the x and y pixel coordinates and skip creating the x_wcs and y_wcs coordinates entirely. You can see in the notebook that the WCSIndex does not make any use of the Xarray world coordinates.

WCSIndex might encapsulate two xarray.indexes.PandasIndex instances for the two x and y pixel coordinates so that they can be used like "classic" dimension coordinates, i.e., for selection, alignment, etc. In addition, WCSIndex.wcs could be used for selection based on world coordinates, like in the notebook but maybe using an API like ds.sel(x=x_world_labels, y=y_world_labels, method="world_nearest"). The approach is somewhat similar than the one used the geo (vector) index here: https://github.com/martinfleis/xvec/pull/7.

There's an open PR in Xarray to make it easier creating custom indexes that encapsulate multiple PandasIndex instances: https://github.com/pydata/xarray/pull/7182

Although it might be possible to reuse the method parameter of .sel() for interpreting the given labels either as pixel or world values (cf. example above), a cleaner solution may be to provide custom options. But it is not supported yet (https://github.com/pydata/xarray/issues/7099).

Do you happen to have a link to the best place for me to learn more about what's shown in your example, i.e how Index classes work and what custom ones should support?

Not yet, unfortunately. This is still very much work in progress. Some useful links:

martindurant commented 1 year ago

I am fascinated by this ongoing discussion, thank you everyone for taking part!

I have a side-question on WCS, and thought I could add it here. Suppose I have a 3D cube of data, where each 2D slice is a simple image with a separate analytical WCS, all with the same projection but different parameters. Is it possible to construct a single WCS for the whole cube, with the WCS parameters being a lookup table dependent on the z location in the cube?

This is the kind of situation that kerchunk may be faced with.

benbovy commented 1 year ago

Is it possible to construct a single WCS for the whole cube, with the WCS parameters being a lookup table dependent on the z location in the cube?

I guess it would be possible to store a list of astropy.wcs.WCS objects for the z dimension in the wcs Dataset accessor and/or provide a StackedWCSIndex that encapsulates (1) a list of WCSIndex instances and (2) an additional PandasIndex for the z dimension? Unless there's a better way of using astropy.wcs.WCS?

(Ideally all WCSIndex instances in a StackedWCSIndex would reuse the same two PandasIndex objects for the x/y pixel coordinates).

Cadair commented 1 year ago

Is it possible to construct a single WCS for the whole cube, with the WCS parameters being a lookup table dependent on the z location in the cube?

With astropy.wcs.WCS it is not. It's technically possible with gWCS which conforms to the same WCS API, but it's pretty experimental.

I think @benbovy's approach of a list of WCSIndexes could work, but I am not sure I have enough grounding in xarrray and indexes to know for sure.

As an aside, I have complex 5D WCSes where the pointing (wcs parameters) change along two of the axes (which are time and wavelength) as well as more intertwined ones where pointing changes along one of the spatial axes and the time axes (of which there are two) so these things can get really complex!!

ianthomas23 commented 1 year ago

Here is a new notebook demonstrating multidimensional lazy-evaluated WCS coordinates in an xarray.Dataset. The data array has shape (nwavelength, ntime, ny, nx) as there are nwavelength sensors taking images of shape (ny, nx) at ntime regular time intervals. Of the 11 WCS attributes required to construct a WCS transformation, 4 of them (all strings) are constant for this data and therefore stored as Dataset attributes. The other 7 (all floats) are different for each image so each is stored as a separate DataArray of shape (nwavelength, ntime). The x and y WCS coordinates are each a dask.array of shape (nwavelength, ntime, ny, nx) and are lazy-evaluated on a per-image basis given a wavelength index and time index. The astropy.WCS transformation objects are created on demand using the 4 constant plus 7 varying WCS attributes.

The example uses an xarray Accessor but not an Index. Example data is pretty small (2, 10, 4096, 4096). I have gone up to (6, 1246, 4096, 4096) with a preprocessed local zarr dataset of ~65 GB where each (ny, nx) image is a separate chunk and the performance scales well.

The notebook isn't live so nothing happens when you play with the widgets at the bottom. I can do a live demo of this, or the bigger one, if anyone is interested.