pangeo-data / pangeo

Pangeo website + discussion of general issues related to the project.
http://pangeo.io
705 stars 189 forks source link

Solar astronomy data on GCS bucket #269

Closed mrocklin closed 5 years ago

mrocklin commented 6 years ago

@MSKirk was kind enough to upload a number of FITS files that correspond to a solar astronomy dataset.

I can load a little bit of the data with the following steps:

import gcsfs
fs = gcsfs.GCSFileSystem()
urls = ['https://storage.googleapis.com/' + fn for fn in fs.ls('pangeo-data/SDO_AIA_Images/094')]

from astropy.io import fits
[img] = fits.open(urls[0])
x = img.data

cc @Cadair, @DanRyanIrish, @martindurant, @mmccarty

martindurant commented 6 years ago

Some fitsio functions work with any file-like objects, some do not. This means that we could relatively easily construct dask-arrays from sets of fits files:

with gcs.open('pangeo-data/SDO_AIA_Images/094/aia_lev1_94a_2012_09_23t05_38_37_12z_image_lev1_fits.fits') as f:
    img = fits.getdata(f)

edit: let me flesh that out a bit: we could use

files = dask.bytes.open_files('gcs://pangeo-data/SDO_AIA_Images/094/*.fits')
with files[0] as f:
    # grab one header; assume we know which extension we need
    head = fits.getheader(f)
naxes = head['NAXIS']
dtype = np.float32  # not sure how to derive from header
shape = head['NAXIS2'], head['NAXIS1'] # 2D

def file_to_arr(fn):
    with fn as f:
        return fits.getdata(f)

arrs = [da.from_delayed(delayed(file_to_arr)(fn), shape=shape, dtype=dtype)
         for fn in files]
arr = da.stack(arrs)
Cadair commented 6 years ago

This is awesome! I will have a go at utilizing some of my DKIST code for this tomorrow.

rabernat commented 6 years ago

This is great! I welcome the involvement of astronomers here.

As the number of datasets continues to grow, it would be good to establish an authoritative catalog. Hopefully this can be intake (#39), but I doubt it works with fits format.

MSKirk commented 6 years ago

I have about 460 GB of data (12,600 images) queued up to get uploaded to the pangeo-data drive. These images haven't been thoughtfully chosen, but were sitting on an external hard disk of mine. They represent 7 different EVU wavelengths that the SDO AIA instrument observes the sun in. Let me know if you have any data specific questions.

martindurant commented 6 years ago

There is no plugin for FITS in Intake yet, this is a question of demand :)

The outstanding problem if actually the coordinates systems (WCS) that FITS typically bundles, which are not aligned with the axes in general, or maybe not even linear. It would be OK to have an array-and-wcs pair, but the model doesn't easily fit into the xarray coordinate-axes model.

Cadair commented 6 years ago

@MSKirk are they level 1.5 images? So effectively all share the same WCS? (Along the time axis)

mrocklin commented 6 years ago

I suggest that we don't bother with intake for the moment. I don't that that that's on the critical path to exploring things here.

I think that the next step is for people to write a couple of simple analyses using this data so that others see what is possible, and then hopefully build off of that original analysis

On Mon, May 21, 2018 at 5:52 PM, Stuart Mumford notifications@github.com wrote:

@MSKirk https://github.com/MSKirk are they level 1.5 images? So effectively all share the same WCS? (Along the time axis)

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/pangeo-data/pangeo/issues/269#issuecomment-390794275, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszJpBvp77c4FENAlCELLbxb1eqMfWks5t0zcegaJpZM4UHdiW .

MSKirk commented 6 years ago

@Cadair, these are level 1 images but they are during a time of stability for the spacecraft. So any corrections that the prep routines make are going to be relatively small. When I am back to my desk, I can work on a level 1.5 dataset.

martindurant commented 6 years ago

Actually, straight-forward fits.open also seems to work on a gcsfs or dask file, so we can

Cadair commented 6 years ago

So I have been playing for the last hour or so I have got as far as this http://pangeo.pydata.org/user/cadair/lab/tree/examples%2FAIA1.ipynb (don't know if you can actually follow that link?)

I made a conda env to run the thing, it seems to have issues with that though, what's the best way to handle deps and such?

martindurant commented 6 years ago

We cannot follow that link - perhaps make it a gist?

mrocklin commented 6 years ago

You should be able to conda/pip install into your notebook directory. However in order to use distributed workers you will also need to tell them which libraries to install when they start up. You can do this easily by modifying the worker-template.yaml file in your home directory. You'll want to extend the env: values as follows:

    env:
      - name: GCSFUSE_BUCKET  # this was here before
        value: pangeo-data
      - name: EXTRA_PIP_PACKAGES
        value: foo bar git+https://github.com/user/package.git@branch
      - name: EXTRA_CONDA_PACKAGES
        value: baz quuz -c conda-forge

I generally prefer to use pip packages over conda in this case because it will happen for every worker every time they start up and conda takes a while.

Cadair commented 6 years ago

I made it work :tada: https://gist.github.com/Cadair/f65fbcb593c6d70eb8f3f3e48324f7c3

I used a cluster to read all the headers of all the files and then used this information to construct a WCS object and fed it all into NDCube, so we now have a coordinate aware Dask array.

I did briefly try to construct a Dask DataFrame out of all the headers in all the FITS files, but I didn't get it to work and I have a limited amount of time for this today!

Cadair commented 6 years ago

(oh the pip requirement is just ndcube as it pulls in everything else)

martindurant commented 6 years ago

@Cadair , in this particular example, the coordinates are well-aligned with the axes, so this would be a good case for xarray as the container, especially if you were to have multiple cubes for each pass-band. Not that I have anything against ndcube, I just don't know (yet) what advantages it has.

Does ndcube reference the dask-array lazily, or does the constructor ndcube.NDCube(arr, wcs=wcs) embed an asarray that would force the array to be materialized?

Cadair commented 6 years ago

While the coordinates are in practice well aligned, the spatial coordinates are subject to a TAN map projection, so they are not really appropriate for xarray like we discussed on the call. NDCube has just been built to be coordinate aware through astropy.wcs so that it can do the coordinate transforms in the "native" way.

NDCube should only reference the array lazily, I haven't encountered anywhere where it dosen't yet.

rabernat commented 6 years ago

fed it all into NDCube, so we now have a coordinate aware Dask array.

As a physical oceanographer / climate modeler, it is very interesting to me to learn how other disciplines handle their datasets. In this case, ndcube sounds quite similar to xarray, so there is possibly some opportunity for synergy / collaboration here.

the spatial coordinates are subject to a TAN map projection, so they are not really appropriate for xarray

This is not fundamentally different from the situation with geographic data. Earth observations and models use different projections and coordinate systems. We often have to transform from one coordinate system to another. Xarray is not limited to cartesian geometry--it supports auxiliary non-dimension coordinates for the purpose of storing such information. However, since these projection / regridding operations are often domain specific, they are not included in xarray. We use packages like cartopy or xesmf to handle the projection problem; these packages can work directly with xarray data structures, making them more flexible and less error prone.

People often assume xarray's data model is more limited than it really is. I think this is a problem with our documentation. It's simply a general purpose container for storing labeled, multi-dimensional arrays. Being built closely around dask from the beginning, however, gives it some real advantages.

rabernat commented 6 years ago

To clarify, I am not suggesting anyone abandon widely established and successful packages such as ndcube. Just commenting out of interest for the sake of discussion. I'm sure you're very happy with your software stack. 😉

Cadair commented 6 years ago

I am very interested in potential uses of xarray for this type of data, I am also very interested to hear how you might handle the problem of two of the axes having their coordinates coupled (in this case via map projection) because last time I looked into xarray I couldn't find a way of reconciling this with the way xarray works.

ndcube is still quite new (although built on top of older astropy tech), and we did evaluate xarray for the problem when we started working on it a little over a year ago. I think one of the major limitations could be the practical requirement to use astropy WCS functionality for the computation of some or all of the coordinate axes.

Cadair commented 6 years ago

I have now updated the gist with what I think is a good implementation of "don't load the whole FITS array into memory" when doing a slicing operation. I would love to hear what @martindurant or anyone else thinks about this way of doing it.

rabernat commented 6 years ago

I am also very interested to hear how you might handle the problem of two of the axes having their coordinates coupled (in this case via map projection) because last time I looked into xarray I couldn't find a way of reconciling this with the way xarray works.

So in this case, we would distinguish between the logical coordinates of the array (must be 1D; one per axis) and the physical coordinate in space (possibly 2D or higher dimensionality). There is an extensive example of this in the xarray docs: http://xarray.pydata.org/en/latest/examples/multidimensional-coords.html

A typical climate dataset (used in the example above) would have the following structure:

<xarray.Dataset>
Dimensions:  (time: 36, x: 275, y: 205)
Coordinates:
  * time     (time) datetime64[ns] 1980-09-16T12:00:00 1980-10-17 ...
    xc       (y, x) float64 189.2 189.4 189.6 189.7 189.9 190.1 190.2 190.4 ...
    yc       (y, x) float64 16.53 16.78 17.02 17.27 17.51 17.76 18.0 18.25 ...
Dimensions without coordinates: x, y
Data variables:
    Tair     (time, y, x) float64 nan nan nan nan nan nan nan nan nan nan ...

Tair is the air temperature, and its dimensions are (time, x, y). There aren't any actual coordinates for x and y as they are just logical array indices. However, there are additional "non-dimension coordinates" xc and yc, both with dimensions (x, y), which represent the latitudes and longitudes of the points. Working with cartopy, xarray can plot the data in geographical coordinates like this:

plot

We are still a long way from full "multidimensional coordinate" support. These open issues describe some improvements we would like to make: pydata/xarray#2028, pydata/xarray#1325, pydata/xarray#1553. Progress on these issues is limited by developer time; that's why I am eager to reach out to scientific communities with similar needs to see if there are ways we can work together. Perhaps the transformation routines you have in ndcube could help resolve some xarray issues!

Will any of you folks be at scipy? I personally won't, but @jhamman and lots of other pangeo / xarray contributors will. Perhaps we should consider a Birds of a Feather proposal on labeled nd-arrays.

Cadair commented 6 years ago

That's really cool! I will have to take a look at some point. I think what astro folks would need is for xc and yc in your example to be calculated on demand from a function rather than having to store them as arrays.

I will be at SciPy this year, a BoF sounds like a pretty good idea!

jhamman commented 6 years ago

+1 for a BoF event at Scipy. Proposals are due June 27.

rabernat commented 6 years ago

@Cadair, I just checked the cluster status and I noticed something odd

$ kubectl get pods --namespace=pangeo
NAME                                           READY     STATUS      RESTARTS   AGE
dask-cadair-5014ba80-f2r4lw                    0/1       OutOfcpu    0          1h
dask-cadair-5014ba80-fbczvs                    0/1       OutOfcpu    0          1h
dask-cadair-5014ba80-fd5d8j                    0/1       OutOfcpu    0          1h
dask-cadair-5014ba80-fgsczn                    0/1       OutOfcpu    0          1h
dask-cadair-5014ba80-fgvblt                    0/1       OutOfcpu    0          1h
dask-cadair-5014ba80-fxlrvj                    0/1       OutOfcpu    0          1h

Did you notice anything strange? I've not encountered that status before.

Cadair commented 6 years ago

Yeah I have been having a few problems with things behaving strangely on me, I am not really sure what to report as I don't fully understand what's going on :laughing:

rabernat commented 6 years ago

My understanding of kubernetes is also low. But my impression is that your dask worker pods are exceeding their cpu quota and being limited or stopped. I'm not sure how this is possible. Does you code use multithreading?

Cadair commented 6 years ago

No, I am just using plain dask array and a map with a plain python function (doing bytes io with dask) and gather on the distributed client.

martindurant commented 6 years ago

On OutOfCpu: https://github.com/kubernetes/kubernetes/issues/42433 is relevant?

rabernat commented 6 years ago

I'm currently at the NSF EarthCube meeting, listening to a talk by Dean Pesnell, the project scientist for the Solar Dynamics Observatory. Can someone explain to me whether and how the dataset that has been placed in pangeo relates to the SDO?

MSKirk commented 6 years ago

They are SDO images. I am also at earth cube. We should chat.


From: Ryan Abernathey [notifications@github.com] Sent: Wednesday, June 06, 2018 9:52 AM To: pangeo-data/pangeo Cc: Kirk, Michael S. (GSFC-671.0)[CATHOLIC UNIV OF AMERICA]; Mention Subject: Re: [pangeo-data/pangeo] Solar astronomy data on GCS bucket (#269)

I'm currently at the NSF EarthCube meeting, listening to a talk by Dean Pesnellhttps://science.gsfc.nasa.gov/sed/bio/william.d.pesnell, the project scientist for the Solar Dynamics Observatory. Can someone explain to me whether and how the dataset that has been placed in pangeo relates to the SDO?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/pangeo-data/pangeo/issues/269#issuecomment-395076286, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AFPh7hMsFa3lBlISRkH8UtlSaLA6PmjEks5t596ygaJpZM4UHdiW.

MSKirk commented 6 years ago

Just FYI, the upload of SDO AIA images is complete. There are 12,600 images in total representing 7 different ultraviolet light wavelength filters (broken out by folder) and 460 GB of data. This set is about 0.004% of the total data archive. Let me know if you have any questions about them.

martindurant commented 6 years ago

Pretty! figure_1

pbranson commented 6 years ago

Nice!

On Fri, Jun 15, 2018 at 3:41 AM, Martin Durant notifications@github.com wrote:

Pretty! [image: figure_1] https://user-images.githubusercontent.com/6042212/41434344-6d030524-6fe9-11e8-8e16-b6aba12be585.png

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/pangeo-data/pangeo/issues/269#issuecomment-397415094, or mute the thread https://github.com/notifications/unsubscribe-auth/AM3bQBa95Tv1YEL5FAqOduQxkjjKeFOgks5t8ryDgaJpZM4UHdiW .

shoyer commented 6 years ago

That's really cool! I will have to take a look at some point. I think what astro folks would need is for xc and yc in your example to be calculated on demand from a function rather than having to store them as arrays.

Lazy coordinates are also quite relevant for climate science. For example, I'm pretty sure that Iris has support for these sort of coordinates.

The logical way to do this in xarray would be to write xc and yc as lazy computations with dask arrays. However, to make this convenient dask really needs to rewrite expressions like ufunc(data)[key] into ufunc(data[key]) (https://github.com/dask/dask/issues/746), so you don't need to compute a complete chunk when you only index part of it.

mrocklin commented 6 years ago

It seems like the next thing to do here is for someone (preferably one of the professional astronomers present in this issue) to have a go at using http://pangeo.pydata.org to run a simple computation and report back any challenges met (like missing software packages). Ideally we would iterate a bit here and then would produce a small notebook to be placed into the official examples directory. We would then point other astronomers to that directory.

stale[bot] commented 6 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

MSKirk commented 6 years ago

Sorry I have been distracted by other projects in the last couple of months. @Cadair have you been continuing to work with solar images in dask?

martindurant commented 6 years ago

I hope you are following https://github.com/ContinuumIO/intake-astro , which now has a partitioned FITS array and table reader that may already be generally useful.

rabernat commented 6 years ago

Hi everyone. I was invited to give a presentation at an EarthCube workshop by Gelu Nita entitled "Towards Integration of Heliophysics, Data, Models, and Analysis Tools." at New Jersey Institute of Technology on Nov. 14. I would really like to be able to show off some simple example of pangeo integrating with solar astronomy. @MSKirk, can we revive this issue and continue to pursue this? It sounds like there has been some progress on the dask side.

I recommend we coordinate with #255 and spin up astro.pangeo.io. Is anyone interested in taking the lead on this?

martindurant commented 6 years ago

I assume @SimonKrughoff would also be interested in an astro.pangeo.io .

MSKirk commented 6 years ago

I like your idea, @rabernat, to shoot for a Nov. 14 demo. What do you need from me to help you move forward?

rabernat commented 6 years ago

There are two ways we could pursue:

The first one is definitely more work.

In either case, you will probably want to use intake-astro to read the SDO_AIA_Images data you placed in the pangeo-data GCS bucket as dask arrays. From that point you could simply visualize the data interactively (perhaps with holoviews / datashader) or you could actually do some scientific analysis. Do you use sunpy? Do we know whether it plays well with dask?

If you place your example in a github repo, together with an appropriate environment.yaml, it should be easy to run with binder.

mrocklin commented 6 years ago

In either case, you will probably want to use intake-astro to read the SDO_AIA_Images data you placed in the pangeo-data GCS bucket as dask arrays. From that point you could simply visualize the data interactively (perhaps with holoviews / datashader) or you could actually do some scientific analysis. Do you use sunpy? Do we know whether it plays well with dask?

To be clear, tools like intake and holoviews/datashader are totally optional. In your position I would stick to whatever technologies you're comfortable with, probably adding in as few as possible to start. I suspect that adding Dask and Kubernetes is enough novelty when getting started.

rabernat commented 6 years ago

Fair point about holoviews/datashader. But intake solves a key problem for how to load the data lazily from GCS. I just tried it and it works beautifully

source = intake.open_fits_array('gcs://pangeo-data/SDO_AIA_Images/094/*.fits')
data = source.to_dask()

(Have to first conda install -c intake intake-astro.)

mrocklin commented 6 years ago

As you like

martindurant commented 6 years ago

Obviously I will also try to help where I can - but time is always limited. The code in intake-astronomy was partly derived from scipy work with some of the people on this thread, so they may have equivalent functionality elsewhere that I am not aware of.

rabernat commented 6 years ago

All of the following works beautifully for me. I am currently browsing the sun. ;)

import intake
import xarray as xr
import numpy as np
import holoviews as hv
from holoviews.operation.datashader import regrid
hv.extension('bokeh')

source = intake.open_fits_array('gcs://pangeo-data/SDO_AIA_Images/094/*.fits')
data = source.to_dask()
# the only trick was first wrapping in xarray...otherwise holoviews doesn't seem to work
xrds = xr.DataArray(data, dims=['time', 'y', 'x'],
                    coords={'time': ('time', np.arange(1800)),
                     'x': ('x', np.arange(4096)),
                     'y': ('y', np.arange(4096))},
                    name='fits_data')
hvds = hv.Dataset(xrds)
im = hvds.to(hv.Image, kdims=['x', 'y'], dynamic=True)

# new cell
%opts Image [width=800, height=800 logz=True] (cmap='magma')
regrid(im, dynamic=True)

image

wtbarnes commented 6 years ago

See discussion on sunpy/sunpy#2715 re: lazily loading FITS. SunPy Map objects can also be constructed from a Dask array and an appropriate metadata object. intake-astro might be a drop-in replacement for all of this though!

At the SciPy meeting back in July, @Cadair and I (with the help of @martindurant and @mrocklin) used pangeo and some of the AIA data on GCS to do some pretty cool data pipe-lining though we never got around to writing it up. The basic workflow was:

  1. Lazily load a sequence (in time) of AIA images for multiple instrument channels
  2. "Prep" the images (i.e. scale to common resolution)
  3. "Derotate" (account for the differential rotation of the Sun so that a single pixel corresponded to the same physical location at each timestep)

After doing all of that, we did a correlation analysis between multiple AIA channels using dask.array.fft. All of the above are steps that solar physicists go through everyday in their analysis of this kind of data. Very often, these steps are done in serial and on desktops and even laptops.

A demo that showed this could all be done in parallel and at scale would have a huge "wow" factor I think.

If this example sounds interesting, I'd be happy to help put it together.

mrocklin commented 6 years ago

:)

The next step here might be to make a binder?

Scientifically I suspect that it would be interesting to rechunk into blocks and then do some time-series analysis on each pixel. I'll let the actual scientists say what they think though :)

On Mon, Oct 1, 2018 at 4:30 PM Ryan Abernathey notifications@github.com wrote:

All of the following works beautifully for me. I am currently browsing the sun. ;)

import intakeimport xarray as xrimport numpy as npimport holoviews as hvfrom holoviews.operation.datashader import regrid hv.extension('bokeh')

source = intake.open_fits_array('gcs://pangeo-data/SDO_AIA_Images/094/*.fits') data = source.to_dask()# the only trick was first wrapping in xarray...otherwise holoviews doesn't seem to work xrds = xr.DataArray(data, dims=['time', 'y', 'x'], coords={'time': ('time', np.arange(1800)), 'x': ('x', np.arange(4096)), 'y': ('y', np.arange(4096))}, name='fits_data') hvds = hv.Dataset(xrds) im = hvds.to(hv.Image, kdims=['x', 'y'], dynamic=True)

new cell%opts Image [width=800, height=800 logz=True] (cmap='magma')

regrid(im, dynamic=True)

[image: image] https://user-images.githubusercontent.com/1197350/46313962-439e9500-c597-11e8-8442-3fc9df85460d.png

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/pangeo-data/pangeo/issues/269#issuecomment-426051796, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszB5w-tv_wsQrr92G020GdWkAaavMks5ugntkgaJpZM4UHdiW .

rabernat commented 6 years ago

Scientifically I suspect that it would be interesting to rechunk into blocks and then do some time-series analysis on each pixel.

FWIW, this is structurally very similar to how we often analyze Earth imagery data. It is really fun to rechunk and persist into a dask cluster and then do the timeseries analysis in parallel.

A demo that showed this could all be done in parallel and at scale would have a huge "wow" factor I think.

I agree. For my part, I am obviously not qualified to develop the heliophysics-specific analysis, but I can advise on how we approach similar tasks from the earth-science perspective.

I support @mrocklin's suggestion to make a binder with the necessary dependencies (e.g. intake-astro, sunpy, etc.) and then develop progressively more complex workflows from there. I have created https://github.com/pangeo-data/pangeo-astro-examples as a place for this. I also created the @pangeo-data/pangeo-astro team and invited everyone on this thread (and more) to it.