Closed mrocklin closed 5 years ago
Some fitsio functions work with any file-like objects, some do not. This means that we could relatively easily construct dask-arrays from sets of fits files:
with gcs.open('pangeo-data/SDO_AIA_Images/094/aia_lev1_94a_2012_09_23t05_38_37_12z_image_lev1_fits.fits') as f:
img = fits.getdata(f)
edit: let me flesh that out a bit: we could use
files = dask.bytes.open_files('gcs://pangeo-data/SDO_AIA_Images/094/*.fits')
with files[0] as f:
# grab one header; assume we know which extension we need
head = fits.getheader(f)
naxes = head['NAXIS']
dtype = np.float32 # not sure how to derive from header
shape = head['NAXIS2'], head['NAXIS1'] # 2D
def file_to_arr(fn):
with fn as f:
return fits.getdata(f)
arrs = [da.from_delayed(delayed(file_to_arr)(fn), shape=shape, dtype=dtype)
for fn in files]
arr = da.stack(arrs)
This is awesome! I will have a go at utilizing some of my DKIST code for this tomorrow.
This is great! I welcome the involvement of astronomers here.
As the number of datasets continues to grow, it would be good to establish an authoritative catalog. Hopefully this can be intake (#39), but I doubt it works with fits format.
I have about 460 GB of data (12,600 images) queued up to get uploaded to the pangeo-data drive. These images haven't been thoughtfully chosen, but were sitting on an external hard disk of mine. They represent 7 different EVU wavelengths that the SDO AIA instrument observes the sun in. Let me know if you have any data specific questions.
There is no plugin for FITS in Intake yet, this is a question of demand :)
The outstanding problem if actually the coordinates systems (WCS) that FITS typically bundles, which are not aligned with the axes in general, or maybe not even linear. It would be OK to have an array-and-wcs pair, but the model doesn't easily fit into the xarray coordinate-axes model.
@MSKirk are they level 1.5 images? So effectively all share the same WCS? (Along the time axis)
I suggest that we don't bother with intake for the moment. I don't that that that's on the critical path to exploring things here.
I think that the next step is for people to write a couple of simple analyses using this data so that others see what is possible, and then hopefully build off of that original analysis
On Mon, May 21, 2018 at 5:52 PM, Stuart Mumford notifications@github.com wrote:
@MSKirk https://github.com/MSKirk are they level 1.5 images? So effectively all share the same WCS? (Along the time axis)
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/pangeo-data/pangeo/issues/269#issuecomment-390794275, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszJpBvp77c4FENAlCELLbxb1eqMfWks5t0zcegaJpZM4UHdiW .
@Cadair, these are level 1 images but they are during a time of stability for the spacecraft. So any corrections that the prep routines make are going to be relatively small. When I am back to my desk, I can work on a level 1.5 dataset.
Actually, straight-forward fits.open
also seems to work on a gcsfs or dask file, so we can
hdu.section[]
, allowing for chunking not just between files.So I have been playing for the last hour or so I have got as far as this http://pangeo.pydata.org/user/cadair/lab/tree/examples%2FAIA1.ipynb (don't know if you can actually follow that link?)
I made a conda env to run the thing, it seems to have issues with that though, what's the best way to handle deps and such?
We cannot follow that link - perhaps make it a gist?
You should be able to conda/pip install into your notebook directory. However in order to use distributed workers you will also need to tell them which libraries to install when they start up. You can do this easily by modifying the worker-template.yaml file in your home directory. You'll want to extend the env:
values as follows:
env:
- name: GCSFUSE_BUCKET # this was here before
value: pangeo-data
- name: EXTRA_PIP_PACKAGES
value: foo bar git+https://github.com/user/package.git@branch
- name: EXTRA_CONDA_PACKAGES
value: baz quuz -c conda-forge
I generally prefer to use pip packages over conda in this case because it will happen for every worker every time they start up and conda takes a while.
I made it work :tada: https://gist.github.com/Cadair/f65fbcb593c6d70eb8f3f3e48324f7c3
I used a cluster to read all the headers of all the files and then used this information to construct a WCS object and fed it all into NDCube, so we now have a coordinate aware Dask array.
I did briefly try to construct a Dask DataFrame out of all the headers in all the FITS files, but I didn't get it to work and I have a limited amount of time for this today!
(oh the pip requirement is just ndcube
as it pulls in everything else)
@Cadair , in this particular example, the coordinates are well-aligned with the axes, so this would be a good case for xarray as the container, especially if you were to have multiple cubes for each pass-band. Not that I have anything against ndcube
, I just don't know (yet) what advantages it has.
Does ndcube reference the dask-array lazily, or does the constructor ndcube.NDCube(arr, wcs=wcs)
embed an asarray
that would force the array to be materialized?
While the coordinates are in practice well aligned, the spatial coordinates are subject to a TAN map projection, so they are not really appropriate for xarray like we discussed on the call. NDCube has just been built to be coordinate aware through astropy.wcs
so that it can do the coordinate transforms in the "native" way.
NDCube should only reference the array lazily, I haven't encountered anywhere where it dosen't yet.
fed it all into NDCube, so we now have a coordinate aware Dask array.
As a physical oceanographer / climate modeler, it is very interesting to me to learn how other disciplines handle their datasets. In this case, ndcube sounds quite similar to xarray, so there is possibly some opportunity for synergy / collaboration here.
the spatial coordinates are subject to a TAN map projection, so they are not really appropriate for xarray
This is not fundamentally different from the situation with geographic data. Earth observations and models use different projections and coordinate systems. We often have to transform from one coordinate system to another. Xarray is not limited to cartesian geometry--it supports auxiliary non-dimension coordinates for the purpose of storing such information. However, since these projection / regridding operations are often domain specific, they are not included in xarray. We use packages like cartopy or xesmf to handle the projection problem; these packages can work directly with xarray data structures, making them more flexible and less error prone.
People often assume xarray's data model is more limited than it really is. I think this is a problem with our documentation. It's simply a general purpose container for storing labeled, multi-dimensional arrays. Being built closely around dask from the beginning, however, gives it some real advantages.
To clarify, I am not suggesting anyone abandon widely established and successful packages such as ndcube. Just commenting out of interest for the sake of discussion. I'm sure you're very happy with your software stack. 😉
I am very interested in potential uses of xarray for this type of data, I am also very interested to hear how you might handle the problem of two of the axes having their coordinates coupled (in this case via map projection) because last time I looked into xarray I couldn't find a way of reconciling this with the way xarray works.
ndcube is still quite new (although built on top of older astropy tech), and we did evaluate xarray for the problem when we started working on it a little over a year ago. I think one of the major limitations could be the practical requirement to use astropy WCS functionality for the computation of some or all of the coordinate axes.
I have now updated the gist with what I think is a good implementation of "don't load the whole FITS array into memory" when doing a slicing operation. I would love to hear what @martindurant or anyone else thinks about this way of doing it.
I am also very interested to hear how you might handle the problem of two of the axes having their coordinates coupled (in this case via map projection) because last time I looked into xarray I couldn't find a way of reconciling this with the way xarray works.
So in this case, we would distinguish between the logical coordinates of the array (must be 1D; one per axis) and the physical coordinate in space (possibly 2D or higher dimensionality). There is an extensive example of this in the xarray docs: http://xarray.pydata.org/en/latest/examples/multidimensional-coords.html
A typical climate dataset (used in the example above) would have the following structure:
<xarray.Dataset>
Dimensions: (time: 36, x: 275, y: 205)
Coordinates:
* time (time) datetime64[ns] 1980-09-16T12:00:00 1980-10-17 ...
xc (y, x) float64 189.2 189.4 189.6 189.7 189.9 190.1 190.2 190.4 ...
yc (y, x) float64 16.53 16.78 17.02 17.27 17.51 17.76 18.0 18.25 ...
Dimensions without coordinates: x, y
Data variables:
Tair (time, y, x) float64 nan nan nan nan nan nan nan nan nan nan ...
Tair
is the air temperature, and its dimensions are (time, x, y)
. There aren't any actual coordinates for x
and y
as they are just logical array indices. However, there are additional "non-dimension coordinates" xc
and yc
, both with dimensions (x, y)
, which represent the latitudes and longitudes of the points. Working with cartopy, xarray can plot the data in geographical coordinates like this:
We are still a long way from full "multidimensional coordinate" support. These open issues describe some improvements we would like to make: pydata/xarray#2028, pydata/xarray#1325, pydata/xarray#1553. Progress on these issues is limited by developer time; that's why I am eager to reach out to scientific communities with similar needs to see if there are ways we can work together. Perhaps the transformation routines you have in ndcube could help resolve some xarray issues!
Will any of you folks be at scipy? I personally won't, but @jhamman and lots of other pangeo / xarray contributors will. Perhaps we should consider a Birds of a Feather proposal on labeled nd-arrays.
That's really cool! I will have to take a look at some point. I think what astro folks would need is for xc
and yc
in your example to be calculated on demand from a function rather than having to store them as arrays.
I will be at SciPy this year, a BoF sounds like a pretty good idea!
+1 for a BoF event at Scipy. Proposals are due June 27.
@Cadair, I just checked the cluster status and I noticed something odd
$ kubectl get pods --namespace=pangeo
NAME READY STATUS RESTARTS AGE
dask-cadair-5014ba80-f2r4lw 0/1 OutOfcpu 0 1h
dask-cadair-5014ba80-fbczvs 0/1 OutOfcpu 0 1h
dask-cadair-5014ba80-fd5d8j 0/1 OutOfcpu 0 1h
dask-cadair-5014ba80-fgsczn 0/1 OutOfcpu 0 1h
dask-cadair-5014ba80-fgvblt 0/1 OutOfcpu 0 1h
dask-cadair-5014ba80-fxlrvj 0/1 OutOfcpu 0 1h
Did you notice anything strange? I've not encountered that status before.
Yeah I have been having a few problems with things behaving strangely on me, I am not really sure what to report as I don't fully understand what's going on :laughing:
My understanding of kubernetes is also low. But my impression is that your dask worker pods are exceeding their cpu quota and being limited or stopped. I'm not sure how this is possible. Does you code use multithreading?
No, I am just using plain dask array and a map with a plain python function (doing bytes io with dask) and gather on the distributed client.
On OutOfCpu: https://github.com/kubernetes/kubernetes/issues/42433 is relevant?
I'm currently at the NSF EarthCube meeting, listening to a talk by Dean Pesnell, the project scientist for the Solar Dynamics Observatory. Can someone explain to me whether and how the dataset that has been placed in pangeo relates to the SDO?
They are SDO images. I am also at earth cube. We should chat.
From: Ryan Abernathey [notifications@github.com] Sent: Wednesday, June 06, 2018 9:52 AM To: pangeo-data/pangeo Cc: Kirk, Michael S. (GSFC-671.0)[CATHOLIC UNIV OF AMERICA]; Mention Subject: Re: [pangeo-data/pangeo] Solar astronomy data on GCS bucket (#269)
I'm currently at the NSF EarthCube meeting, listening to a talk by Dean Pesnellhttps://science.gsfc.nasa.gov/sed/bio/william.d.pesnell, the project scientist for the Solar Dynamics Observatory. Can someone explain to me whether and how the dataset that has been placed in pangeo relates to the SDO?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/pangeo-data/pangeo/issues/269#issuecomment-395076286, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AFPh7hMsFa3lBlISRkH8UtlSaLA6PmjEks5t596ygaJpZM4UHdiW.
Just FYI, the upload of SDO AIA images is complete. There are 12,600 images in total representing 7 different ultraviolet light wavelength filters (broken out by folder) and 460 GB of data. This set is about 0.004% of the total data archive. Let me know if you have any questions about them.
Pretty!
Nice!
On Fri, Jun 15, 2018 at 3:41 AM, Martin Durant notifications@github.com wrote:
Pretty! [image: figure_1] https://user-images.githubusercontent.com/6042212/41434344-6d030524-6fe9-11e8-8e16-b6aba12be585.png
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/pangeo-data/pangeo/issues/269#issuecomment-397415094, or mute the thread https://github.com/notifications/unsubscribe-auth/AM3bQBa95Tv1YEL5FAqOduQxkjjKeFOgks5t8ryDgaJpZM4UHdiW .
That's really cool! I will have to take a look at some point. I think what astro folks would need is for xc and yc in your example to be calculated on demand from a function rather than having to store them as arrays.
Lazy coordinates are also quite relevant for climate science. For example, I'm pretty sure that Iris has support for these sort of coordinates.
The logical way to do this in xarray would be to write xc
and yc
as lazy computations with dask arrays. However, to make this convenient dask really needs to rewrite expressions like ufunc(data)[key]
into ufunc(data[key])
(https://github.com/dask/dask/issues/746), so you don't need to compute a complete chunk when you only index part of it.
It seems like the next thing to do here is for someone (preferably one of the professional astronomers present in this issue) to have a go at using http://pangeo.pydata.org to run a simple computation and report back any challenges met (like missing software packages). Ideally we would iterate a bit here and then would produce a small notebook to be placed into the official examples directory. We would then point other astronomers to that directory.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Sorry I have been distracted by other projects in the last couple of months. @Cadair have you been continuing to work with solar images in dask?
I hope you are following https://github.com/ContinuumIO/intake-astro , which now has a partitioned FITS array and table reader that may already be generally useful.
Hi everyone. I was invited to give a presentation at an EarthCube workshop by Gelu Nita entitled "Towards Integration of Heliophysics, Data, Models, and Analysis Tools." at New Jersey Institute of Technology on Nov. 14. I would really like to be able to show off some simple example of pangeo integrating with solar astronomy. @MSKirk, can we revive this issue and continue to pursue this? It sounds like there has been some progress on the dask side.
I recommend we coordinate with #255 and spin up astro.pangeo.io. Is anyone interested in taking the lead on this?
I assume @SimonKrughoff would also be interested in an astro.pangeo.io .
I like your idea, @rabernat, to shoot for a Nov. 14 demo. What do you need from me to help you move forward?
There are two ways we could pursue:
The first one is definitely more work.
In either case, you will probably want to use intake-astro to read the SDO_AIA_Images
data you placed in the pangeo-data GCS bucket as dask arrays. From that point you could simply visualize the data interactively (perhaps with holoviews / datashader) or you could actually do some scientific analysis. Do you use sunpy? Do we know whether it plays well with dask?
If you place your example in a github repo, together with an appropriate environment.yaml
, it should be easy to run with binder.
In either case, you will probably want to use intake-astro to read the SDO_AIA_Images data you placed in the pangeo-data GCS bucket as dask arrays. From that point you could simply visualize the data interactively (perhaps with holoviews / datashader) or you could actually do some scientific analysis. Do you use sunpy? Do we know whether it plays well with dask?
To be clear, tools like intake and holoviews/datashader are totally optional. In your position I would stick to whatever technologies you're comfortable with, probably adding in as few as possible to start. I suspect that adding Dask and Kubernetes is enough novelty when getting started.
Fair point about holoviews/datashader. But intake solves a key problem for how to load the data lazily from GCS. I just tried it and it works beautifully
source = intake.open_fits_array('gcs://pangeo-data/SDO_AIA_Images/094/*.fits')
data = source.to_dask()
(Have to first conda install -c intake intake-astro
.)
As you like
Obviously I will also try to help where I can - but time is always limited. The code in intake-astronomy was partly derived from scipy work with some of the people on this thread, so they may have equivalent functionality elsewhere that I am not aware of.
All of the following works beautifully for me. I am currently browsing the sun. ;)
import intake
import xarray as xr
import numpy as np
import holoviews as hv
from holoviews.operation.datashader import regrid
hv.extension('bokeh')
source = intake.open_fits_array('gcs://pangeo-data/SDO_AIA_Images/094/*.fits')
data = source.to_dask()
# the only trick was first wrapping in xarray...otherwise holoviews doesn't seem to work
xrds = xr.DataArray(data, dims=['time', 'y', 'x'],
coords={'time': ('time', np.arange(1800)),
'x': ('x', np.arange(4096)),
'y': ('y', np.arange(4096))},
name='fits_data')
hvds = hv.Dataset(xrds)
im = hvds.to(hv.Image, kdims=['x', 'y'], dynamic=True)
# new cell
%opts Image [width=800, height=800 logz=True] (cmap='magma')
regrid(im, dynamic=True)
See discussion on sunpy/sunpy#2715 re: lazily loading FITS. SunPy Map objects can also be constructed from a Dask array and an appropriate metadata object. intake-astro
might be a drop-in replacement for all of this though!
At the SciPy meeting back in July, @Cadair and I (with the help of @martindurant and @mrocklin) used pangeo and some of the AIA data on GCS to do some pretty cool data pipe-lining though we never got around to writing it up. The basic workflow was:
After doing all of that, we did a correlation analysis between multiple AIA channels using dask.array.fft
. All of the above are steps that solar physicists go through everyday in their analysis of this kind of data. Very often, these steps are done in serial and on desktops and even laptops.
A demo that showed this could all be done in parallel and at scale would have a huge "wow" factor I think.
If this example sounds interesting, I'd be happy to help put it together.
:)
The next step here might be to make a binder?
Scientifically I suspect that it would be interesting to rechunk into blocks and then do some time-series analysis on each pixel. I'll let the actual scientists say what they think though :)
On Mon, Oct 1, 2018 at 4:30 PM Ryan Abernathey notifications@github.com wrote:
All of the following works beautifully for me. I am currently browsing the sun. ;)
import intakeimport xarray as xrimport numpy as npimport holoviews as hvfrom holoviews.operation.datashader import regrid hv.extension('bokeh')
source = intake.open_fits_array('gcs://pangeo-data/SDO_AIA_Images/094/*.fits') data = source.to_dask()# the only trick was first wrapping in xarray...otherwise holoviews doesn't seem to work xrds = xr.DataArray(data, dims=['time', 'y', 'x'], coords={'time': ('time', np.arange(1800)), 'x': ('x', np.arange(4096)), 'y': ('y', np.arange(4096))}, name='fits_data') hvds = hv.Dataset(xrds) im = hvds.to(hv.Image, kdims=['x', 'y'], dynamic=True)
new cell%opts Image [width=800, height=800 logz=True] (cmap='magma')
regrid(im, dynamic=True)
[image: image] https://user-images.githubusercontent.com/1197350/46313962-439e9500-c597-11e8-8442-3fc9df85460d.png
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/pangeo-data/pangeo/issues/269#issuecomment-426051796, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszB5w-tv_wsQrr92G020GdWkAaavMks5ugntkgaJpZM4UHdiW .
Scientifically I suspect that it would be interesting to rechunk into blocks and then do some time-series analysis on each pixel.
FWIW, this is structurally very similar to how we often analyze Earth imagery data. It is really fun to rechunk and persist into a dask cluster and then do the timeseries analysis in parallel.
A demo that showed this could all be done in parallel and at scale would have a huge "wow" factor I think.
I agree. For my part, I am obviously not qualified to develop the heliophysics-specific analysis, but I can advise on how we approach similar tasks from the earth-science perspective.
I support @mrocklin's suggestion to make a binder with the necessary dependencies (e.g. intake-astro, sunpy, etc.) and then develop progressively more complex workflows from there. I have created https://github.com/pangeo-data/pangeo-astro-examples as a place for this. I also created the @pangeo-data/pangeo-astro team and invited everyone on this thread (and more) to it.
@MSKirk was kind enough to upload a number of FITS files that correspond to a solar astronomy dataset.
I can load a little bit of the data with the following steps:
cc @Cadair, @DanRyanIrish, @martindurant, @mmccarty