open-gamma-ray-astro / gamma-astro-data-formats

Data formats for gamma-ray astronomy
https://gamma-astro-data-formats.readthedocs.io
Creative Commons Attribution 4.0 International
29 stars 27 forks source link

HEALPix Formats #80

Open woodmd opened 7 years ago

woodmd commented 7 years ago

I have posted a first draft of a proposal for HEALPix FITS formats to this page. This thread is intended to collect comments and feedback on this proposal.

The current draft spec was based mostly on the conventions we currently use in the Fermi Science Tools. The HEALPix-specific ST functionality is not yet widely used so we are also open to updating the ST implementation to reflect whatever we agree on for this specification.

tburnett commented 7 years ago

I'd like to have a format that would support pointlike. Currently a pointlike binned data file has the following format with HDU tables for BANDS, PIXELS, and GTI BANDS

For 8 years of data, front/back 4 bins per decade from 10 MeV to 1 TeV, the file length is 120 MB; with the PSF event classes it is 165 MB.

cdeil commented 7 years ago

@woodmd - This is what I mentioned earlier:

I'm not sure if HIPS has anything concretely that affects our specs here. I'm just mentioning it as an example of a multi-resolution format built on top of HEALPIX. They do have "cubes" (see section 4.3), but there is not much to their cube format, and it's not directly applicable to what we have.

The NUNIQ pixel ID scheme might be something to adopt if we ever want to have a format that supports multi-resolution HEALPIX in a single table. It could be a change to the current sparse format: http://gamma-astro-data-formats.readthedocs.io/en/latest/skymaps/healpix/index.html#sparse-format currently to support multi-resolution pixels (within one band), or just a new format added later if deemed useful for some use cases.

eacharles commented 7 years ago

I discussed the proposal for quite some time with Matt yesterday. Although there are a lot of nice things there, there are two things that I'm not really in favor of.

1) On a sort of philosophical level I think that the proposal breaks what I had been thinking of as the modularity of the HDU. Up till now we had the model that 1 HDU more or less corresponded to a single image. (In some cases this was a multi-dimensional image that included energy, or different colors, or whatever but still, the general principle was 1HDU = 1IMAGE). This is nice because it means that you can always use ftools or astropy or whatever to manipulate files without too much complication.
While there are some advantages that might be gained by having, for example, a time series of images stored in a single HDU, I think that in practical terms the added complexity needed to move away from the model of 1 HDU <-> 1 Image will far outweigh the benefits.

2) I think that the way the BANDS hdu is being used for the SPARSE format will be very problematic down the road.

Specifically, for the SPARSE format, the proposal requires the user to access the BANDS to get the slicing information about the pixel index needed to unroll the index.

Although it is true that you already have to read extra HDUs to get things like energy bounds, a very important distinction is that these extra HDUs only define the binning of the data, not internals of the data structure. I.e., you can treat the data as an n-dimensional array, and then tack on the bin edges after the fact once you need them. On the other hand, with the proposal here, simply in order to read the data into an n-dimensional array requires you to also access the BANDS table. This means, that for example, to copy a single image from one file to another will require you also to copy the BANDS table. In practice that is often not the case now, as typically the other information in the BANDS table (or the ENERGIES or EBOUNDS) table remains the same for the entire analysis chain.

woodmd commented 7 years ago

I'd also prefer to keep all image data in a single HDU but the current design is partially constrained by the limitations of the FITS format. The solution for other binary formats is to have a hierarchical structure that groups associated data tables together. FITS does support HDU groups (see https://heasarc.gsfc.nasa.gov/fitsio/c/c_user/node60.html) but it's not clear to me how well this functionality is supported by different FITS implementations (for instance I'm pretty doubtful that the FITS I/O layer in the STs currently supports this) and it's not a feature of FITS that I have much familiarity with.

Note that I do think it will be valuable to have the flexibility to write maps of different geometries to the same file. This would allow FITS files to be used more as general purpose data containers and would reduce the amount of file bookkeeping required of the user. For instance currently for multi-component LAT analyses one has to keep track of separate counts, exposure, FT1, and source map files for each component. Being able to put counts maps for all components in a single file seems like it could be a nice convenience. Note that the spec as currently written should be entirely backwards-compatible with the current convention of one EBOUNDS table per file and in principle no changes would be required of existing software like the STs or ctools.

With regard to the SPARSE format Eric preferred to keep to the convention of having one column per energy slice (band) which for variable-sized images means transposing the table to a single row. Further one could use columns of the same table to define quantities that are currently written to the BANDS table (e.g. energy bounds, nside, npix etc.). This arrangement has the nice feature that it is entirely self-contained in a single HDU and doesn't require one to look in another HDU to figure out the band index mapping. The potential disadvantage is that when dealing with a large number of bands (e.g. as for a time series) one may run into the limit on the number of columns in a FITS BinTable (999). A workaround might be to have the option to put all bands into a single column. However with this one would lose the ability to perform random access into individual bands.

I did some performance comparisons between the current proposal and the alternatives I discussed with Eric:

I performed all tests with the astropy.io.fits reader with memmap=True. I found that getting consistent performance numbers is quite challenging and seems to be heavily dependent on the type of disk access (local or NFS-mounted). However I was able to conclude that the relative speed among these solutions differs by no more than a factor of 2-3. Given that disk I/O is a pretty small part of the runtime for photon analysis the guiding factor for choosing a format should probably be convenience and simplicity rather than raw performance. On this basis I'm now leaning more toward going with a column-wise SPARSE format but I'm curious to know if other people have thoughts on this.

cdeil commented 7 years ago

@woodmd did some HEALPIX spec updates in a65a70d05d06fdb45c85ac82a575217a4a364ea3 and updated the example files in a1a4e3fbcde01b864cea7baa3241b19691fdf433 . So everyone - please have a look at the latest spec here: http://gamma-astro-data-formats.readthedocs.io/en/latest/skymaps/index.html

@jknodlseder - Maybe you also want to have a look?

Personally, I'm not sure I have much to add to the discussion, but I'll make some comments:

@eacharles - I agree that having "one image = one HDU" is very important. But I think this was lost already with EBOUNDS and ENERGIES being in separate HDUs, and there's no simple way to join that into one HDU. (Did anyone talk to FITS WCS experts and try to embed it as a table WCS axis in the header or change to BINTABLE in order to join them?) Is it really important to be able to access pixel data in the SPARSE HEALPIX format without accessing the BANDS table? Are there important use cases to work with the pixels at all without knowing what energy band or event type they are? That said, I haven't thought about SPARSE HEALPIX much, and don't have a preference for how it's serialised.

Note that I do think it will be valuable to have the flexibility to write maps of different geometries to the same file. This would allow FITS files to be used more as general purpose data containers and would reduce the amount of file bookkeeping required of the user. For instance currently for multi-component LAT analyses one has to keep track of separate counts, exposure, FT1, and source map files for each component. Being able to put counts maps for all components in a single file seems like it could be a nice convenience.

For me, this would be more important to look at: what do the WCS-based maps formats look like? Can the ENERGIES / EBOUNDS / BANDS table be used for them 1:1 as they are now defined in the HEALPIX maps section? What would be your proposed serialisation format for the pixels from the WCS data (e.g. for different event types as extra axis, or separate IMAGE HDUs, or change to BINTABLE)? Do we want to support non-homogeneous geometry across energy bins or event types there as well? I assume yes for efficiency / flexibility, and then extra axis in an IMAGE HDU doesn't work and we need a BANDS table pointing to multiple 2-dim image HDUs with the pixels?

With whatever solution we end up with, I think it's important to clearly state if having multiple maps in a given FITS file is supported or not. And if yes, given an example file that does have multiple maps in a single file, and state how tools can figure out which HDUs (pixels, ENERGIES, EBOUNDS, BANDS) belong to which sky map.

woodmd commented 7 years ago

I think WCS-based maps could probably use most of the conventions of the HEALPix maps:

There's currently a placeholder for documenting the WCS-based map formats (see http://gamma-astro-data-formats.readthedocs.io/en/latest/skymaps/wcs/index.html) so it would probably be good to start working on that in parallel with the HEALPix formats.