collaborate with Pangeo?

rabernat commented 3 years ago

I just discovered this project and the cloud-fits repo. It looks like a great contribution!

I wanted to invite you to collaborate with the Pangeo community around cloud-native patterns for scientific data analysis. We are mostly coming from the geospatial / weather / climate world. A lot of our efforts has gone into developing cloud-native workflows that scale with Dask for distributed processing, and the file format is an important part of this.

We have written about our approach to cloud-based data a bit here: http://pangeo.io/data.html#data-in-the-cloud We use the Zarr format heavily.

We had some explorations with astro data early on in our project here: https://github.com/pangeo-data/pangeo-astro-examples

I'm tagging @martindurant, an ex-astronomer who now maintains filesystem-spec, s3fs, gcsfs, and other such tools which are crucial for getting good performance with data in object store. Martin has often speculated about how FITS could be adopted to work better with object storage. I imagine he might want to say hello.

Thanks again for your work on open source! If you think a conversation with Pangeo folks would be helpful to your goals, we'd be happy to set something up.

martindurant commented 3 years ago

Hello!

Indeed, you might find some inspiration from intake-astro which contains some code for parallel loading of FITS with Dask from any fsspec URLs, including cloud object stores and more. This was to a great extent based on conversations with sunpy people such as @Cadair (who I believe have also developed the idea further).

jbcurtin commented 3 years ago

hey @rabernat and @martindurant. Thank you for the points. I've been focused on writing a CI/CD solution these past few weeks and am just now getting around to reading my Github Notifications. I'll be free to follow up with these leads a few weeks from now. Thank you again!

rabernat commented 3 years ago

No worries @jbcurtin! Just about everyone is snowed under right now! 🙃

We recently published a preprint you might be interested in: Cloud Native Repositories for Big Scientific Data.

jbcurtin commented 3 years ago

hey @rabernat , thank you for sending over "Cloud Native Repositories for Big Scientific Data.(CNRBSD)" My background is in MLOps, DevOps, & Data Engineering(ETL). My views are in alignment with a lot of points the CNRBSD paper makes. 👍

I've looked into Zarr, Cloud Object Storage, and a few other solutions for a project that initiated the creation of this Github Repository, Cloud Optimized Fits. The need came about when we started looking at serving TESS data-cubes over AWS Lambda.

The mindset I've accepted for this problem area is, all management of ARD data can be roughly organized in some kind of way that is specifically required by the tools a scientist uses. Dask(ML), Kubeflow, Jupyter Notebooks, Astropy, Python, etc. Depending on the tool utilization, an appropriate implementation can be written in a low number of sprints to accommodate the scientist and optimize loading data into algorithms created by scientist(s).

The most basic blue-print of this resource serving structure is HTTP(s) protocol implemented in Nginx, Caddy, and other HTTP(s) comparable services capable of serving static files; including "Cloud Object Storage". By focusing on the HTTP(s) headers provided. We're given basic functions required to perform more complex actions of serving data over larger networks(Internet). While we were working with this idea, we looked at Cloud Providers(AWS, GCP, Digital Ocean) and didn't consider specialized requirements for HPC environments. Additionally, the abstraction layer we're looking to leverage has been implemented in most if not all modern HTTP(s) servers such as Nginx, Apache2, Jetty, JBoss, Caddy, and other servers.

astro-cloud is meant to enclose Python Client logic and DevOps Server logic; providing a complete solution when deploying into the cloud or a local network of machines(Kubernetes). It could even be deployed into Lamba-like environments reducing infrastructure cost of serving data to near $0.

I can't say officially, but I'd be willing to wager that everyone supportive of the research that has gone into writing this Github Repository would welcome the idea of collaborating with Pangeo. Is there someone I can reach out to via Email? Where should I look on the Internet to learn more about collaborating with Pangeo?

martindurant commented 3 years ago

(your cloud-optimised-fits link points to a non-public google doc)

Note that if your data is stored on a server supporting range requests or cloud store, you can already load FITS files without any special handling, e.g.,

with fsspec.open("gcs://pangeo-data/SDO_AIA_Images/094/aia_lev1_94a_2012_09_23t05_38_37_12z_image_lev1_fits.fits") as f:
     print(fits.getheader(f))

(notice that this data is in a pangeo bucket) The efficiency of doing this will depend on the data layout in the file, versus the caching and access pattern of whatever the data is used for.

If you read the thread https://github.com/pangeo-data/pangeo/issues/269, you will see that many cloud big-data operations have been demonstrated with FITS data, and there is no need to invent new server technology or formats.

As well as the code I liked above to fetch extensions data from multiple FITS files in parallel, I would also like to draw your attention to https://github.com/intake/fsspec-reference-maker/ , which is an effort to extract metadata and/or offsets to binary blocks within cloud-accessible files. The process was conceived with HDF5 files in mind, as a way to make them directly readable with zarr.

Something very similar could work for FITS files (so long as there is no whole-file compression). The interesting difference would be

you can convert header information into zarr format: you can store everything as attributes, but to make these usable for astronomy, you would need code somewhere to interpret these attributes
or the index metadata could store the original header text, and the loader could address multiple original files or parts of files as a virtual file/dataset

@DPeterK : this all sounds like the many-small-files problem you are facing, and it occurs to me that, after all, I can think of a way to write your method in terms of an fsspec implementation (or modification to ReferenceFileSystem), where each block returns bytes, as expected by a filesystem, but the bytes are generated by reading the original data files.

martindurant commented 3 years ago

and here is the example from your readme, again with no special code, and not downloading the 44GB file

n [1]: import fsspec

In [3]: import astropy.io.fits as fits

In [4]: f = fsspec.open("s3://stpubdata/tess/public/mast/tess-s0022-4-4-cube.fits", requester_pays=True).open()  # or s3fs.S3FileSystem(...).open(path)

In [5]: hdulist = fits.open(f)

In [6]: list(hdulist)
Out[6]:
[<astropy.io.fits.hdu.image.PrimaryHDU at 0x119f78910>,
 <astropy.io.fits.hdu.image.ImageHDU at 0x119fa5490>,
 <astropy.io.fits.hdu.table.BinTableHDU at 0x11b702250>]

In [7]: heads = [h.header for h in hdulist]

In [9]: heads[0]
Out[9]:
SIMPLE  =                    T / conforms to FITS standard
BITPIX  =                    8 / array data type
NAXIS   =                    0 / number of array dimensions
EXTEND  =                    T
NEXTEND =                    2 / number of standard extensions
EXTNAME = 'PRIMARY '           / name of extension
EXTVER  =                    1 / extension version number (not format version)
SIMDATA =                    F / file is based on simulated data
ORIGIN  = 'STScI/MAST'         / institution responsible for creating this file
DATE    = '2020-04-08'         / file creation date.
TSTART  =    1899.312389541680 / observation start time in BTJD
TSTOP   =    1926.478743362784 / observation stop time in BTJD
DATE-OBS= '2020-02-19T19:28:41.272' / TSTART as UTC calendar date
DATE-END= '2020-03-17T23:28:14.243' / TSTOP as UTC calendar date
CREATOR = '13374 FfiExporter'  / pipeline job and program used to produce this f
PROCVER = 'spoc-4.0.27-20200326' / SW version
FILEVER = '1.0     '           / file format version
TIMVERSN= 'OGIP/93-003'        / OGIP memo number for file format
TELESCOP= 'TESS    '           / telescope
INSTRUME= 'TESS Photometer'    / detector type
DATA_REL=                   31 / data release version number
ASTATE  =                    T / archive state F indicates single orbit processi
SCCONFIG=                  174 / spacecraft configuration ID
RADESYS = 'ICRS    '           / reference frame of celestial coordinates
FFIINDEX=                32249 / number of FFI cadence interval
EQUINOX =               2000.0 / equinox of celestial coordinate system
CRMITEN =                    T / spacecraft cosmic ray mitigation enabled
CRBLKSZ =                   10 / [exposures] s/c cosmic ray mitigation block siz
CRSPOC  =                    F / SPOC cosmic ray cleaning enabled
SECTOR  =                   22 / Observing sector
CAMERA  =                    4 / Camera number
CCD     =                    4 / CCD chip number

jbcurtin commented 3 years ago

hey @martindurant , thank you for sending this overview of fsspec. fsspec provides a lot of functionality very quickly. I like that it installs easily and provides appropriate error messages for additional functionality. We could utilize parts of this technology and I'll bring it up with the team when we next talk.

( Please request access to the Doc. I'll run it past managers to make sure the info can be shared publicly before I open it up. )

jbcurtin commented 3 years ago

hey @martindurant, managers have approved opening up the documents. I've updated the comment above to point to documents meant for the public.

jbcurtin / astro-cloud

collaborate with Pangeo? #1