intake / intake-astro

Astronomical data sources for Intake
BSD 2-Clause "Simplified" License
6 stars 5 forks source link

Adding fsspec support to `astropy.io.fits` #8

Open barentsen opened 2 years ago

barentsen commented 2 years ago

I'm opening this issue to draw attention to the fact that I opened an Astropy PR today (https://github.com/astropy/astropy/pull/13238) which would add explicit support for opening FITS files with fsspec.

This PR may not necessarily benefit intake-astro, because I believe this package already makes clever use of AstroPy's lazy data loading features (i.e., ImageHDU.section and lazy_load_hdus=True).

Perhaps the most important contribution of the Astropy PR, however, is that it adds documentation on the use of fsspec with FITS files, aimed at astronomers. For example, the Astropy PR would add a chapter on the use of fsspec to the Astropy docs which can be previewed here:

https://astropy--13238.org.readthedocs.build/en/13238/io/fits/usage/cloud.html

/ping @martindurant: I'd be interested to hear your thoughts on the PR. I'm happy to be told this is a bad idea, or have my attention drawn to any incorrect claims I may accidentally have made about fsspec in the Astropy docs.

martindurant commented 2 years ago

I think it's certainly a good idea!

Might I also take this opportunity to ask if you have any thoughts on zarr as a cloud-native storage format for astronomy data or kerchunk (sorry, docs site is temporarily down) of FITS files?

barentsen commented 2 years ago

Might I also take this opportunity to ask if you have any thoughts on zarr as a cloud-native storage format for astronomy data or kerchunk (sorry, docs site is temporarily down) of FITS files?

I think zarr is brilliant.

Adopting zarr in astronomy would bring chunked arrays to astronomy, which is sorely missing from the FITS standard. I suppose FITS does offer a tile-based compression scheme, but it is poorly supported, e.g., Astropy does not support tile decompression.

Perhaps more importantly, adopting zarr would make it easier for space and earth sciences to share tools, e.g., xarray and dask would become way more popular in astronomy. If changing data formats were free and easy, one would no doubt argue that NASA could get more science per dollar if it encouraged all its scientists to store N-dimensional data in the same cloud-native way.

For zarr to gain adoption in astronomy, I suppose we'd have to make it as easy as possible for existing tools to accept a zarr.Group as a drop-in replacement for an astropy.io.fits.HDUList object. It is not clear to me what this involves without sitting down and creating tutorials which demonstrate the use of zarr with astronomy tools. I suppose kerchunk would make it significantly easier to create those tutorials using real data. I think someone should fund this exploratory tutorial-writing effort!

martindurant commented 2 years ago

For zarr to gain adoption in astronomy, I suppose we'd have to make it as easy as possible for existing tools to accept a zarr.Group as a drop-in replacement for an astropy.io.fits.HDUList object.

Yes I agree, and I don't know either. I would have thought not too much, since mostly it's the downstream array and table classes that matter (tables can come from lots of places, and that's OK). If needed, it's probably not crazy to emulate a zarr to look like a list, each entry having a .header (the attributed) and .data (the lazy array/rec-array). Since astropy seems to support ADSF OK, I assume there's a template to go by.

Cadair commented 2 years ago

Astropy does not support tile decompression

To be pedantic, it does, it just doesn't support decompressing a single tile. FTR, fitsio does support this.

I think there is a lot of promise for integrating zarr and ASDF to get the best of both world (chunked arrays and rich metadata with excellent support for astronomy). I know the asdf library people are looking into this as time allows, and it's something that's on my long-term list as well.

MSKirk commented 2 years ago

We were exploring some related development with NVIDIA to create a direct to GPU FITS reader. I don't remember the specific roadblock, but the engineers on the project were concerned about the non-uniform number of FITS HDUs across all files being an issue to create a generic tool. I remember that handling nonstandard metadata in the header was also a concern on their part, but I doubt we will ever eliminate that.

martindurant commented 2 years ago

fitsio does support this.

I seem to recall that FITS has tons of features in the spec, but no one uses them in practice. The same is true for HDF5, another long-term organically-grown monolithic implementation.

As for ASDF, a pity to design a new specialist data format for use by just one field. I know that, in theory, it's generic, but no one is looking at it from, e.g., HDF. It would still need whole community buyin to transition, and if it doesn't work well remote, it fails. Interestingly though, it will be another kerchunk target, so it would be possible to view both FITS and ASDF as zarr: single files, thousands of files or even mix of formats as single data sets.

We were exploring some related development with NVIDIA to create a direct to GPU FITS reader.

Is this work preserved anywhere?

handling nonstandard metadata in the header

Any idea what? The job of the reader, I would have thought, it to pull arrays into memory and not dependent on metadata; but isn't all FITS metadata plain text anyway?

MSKirk commented 2 years ago

Is this work preserved anywhere?

Short answer is that I am not sure. We were working with Oded Green and Kristopher Keipert from NVIDIA back in November of last year. The email thread went stale in December. I can kick it again if you think it would be useful.

martindurant commented 2 years ago

@MSKirk , I don't know if it's useful, I don't have much of an insight into astronomy on the GPU - far too much of a generalist these days!