open-gamma-ray-astro / gamma-astro-data-formats

Data formats for gamma-ray astronomy
https://gamma-astro-data-formats.readthedocs.io
Creative Commons Attribution 4.0 International
30 stars 27 forks source link

SED and Binned Likelihood Formats #45

Open woodmd opened 8 years ago

woodmd commented 8 years ago

This is an open thread to collect comments and feedback on the proposed formats for Binned Likelihood Profiles. Since these formats are effectively just SEDs with extra information in each bin it would probably be a good idea to try to share conventions with the Flux points format. @zblz Did you have any thoughts on this?

Here are a few open questions:

cdeil commented 8 years ago

Do we need a column to specify the bin center?

Yes, please.

E.g. most HESS spectra have been published with the points not at the log bin center, but using this weighting method: https://gammapy.readthedocs.io/en/latest/tutorials/flux_point/index.html

I'm not saying that particular method is useful, but I think there are use cases where people want to specify the center and it's easy to support.

In Gammapy we currently have this class for flux points and just call this column ENERGY: https://gammapy.readthedocs.io/en/latest/api/gammapy.spectrum.DifferentialFluxPoints.html

For me, E_CTR or whatever you propose is fine ... we'll just adapt Gammapy to whatever is put in this spec.


I do have some thoughts on the other points you raise, but don't feel strongly either way about any of them, so won't comment for now.

My main whish (see #6) would be that we create a flux point spec. Flux points are even more common than what you've put here, so having a serialisation format specified for interchange is important. And it's not clear to me if that flux point spec would be separate from this spec, or a subset? So the discussion e.g. which columns should be optional is related to the question whether flux points are a separate spec.

@woodmd - Do you have thoughts on #6 and how it relates to this spec, or even time to write that down also?

woodmd commented 8 years ago

Thanks for the feedback. I already added the E_CTR column since as you point out not everyone may want to use the geometric center. I also dropped E2DFDE but kept the other reference columns for now. One thing that's currently missing is some specification of the spatial/spectral model in the FITS header. The simplest approach would be to pack everything into a string which the user could parse to extract the model definition, e.g. something like the following:

MODEL='{SpectrumType:PowerLaw,Index:2.0,SpatialType:PointSource}'

However for this to work we need to agree on a convention for serializing the model definition. For now we could let everyone use their own convention and have this just serve as an annotation. One advantage of having REF columns in different units is that the SED can be interpreted without knowing the exact form of the spectral model that was used to extract it.

The flux point spec could easily be defined as a subset of the bin-by-bin likelihood spec. The Likelihood SED is basically just an SED with a likelihood profile attached to each bin so I think it would be logical to make the flux point spec a subset of this one. Basically you would only need to drop the NORM_SCAN and DLOGLIKE_SCAN columns although you may want to make some of the other columns optional. If you think it would be useful I'd be happy to upload a draft of the flux point spec that uses a subset of the columns from this spec.

cdeil commented 8 years ago

If you think it would be useful I'd be happy to upload a draft of the flux point spec that uses a subset of the columns from this spec.

Yes, would be very useful to have!

Fermipy and Gammapy and ctools can produce spectra in that format, and then people can put them in SED modeling e.g. with Naima. Of course that's possible now, but one always has to reformat a bit, i.e. adjust column names or units, which is annoying.

So technically the simplest solution is to duplicate the info on the columns on two separate pages, right? (As opposed to trying to make one page or having one as the diff wrt. the other.)

One thing that's currently missing is some specification of the spatial/spectral model in the FITS header.

That is a very difficult and potentially very complicated and large task. If you want to describe this a bit or already give suggestions how to put some simple info about spectral or spatial models (and maybe spectrum analysis method, like aperture photometry or spatial model fit) that covers many use cases, OK. But I also think it would be OK to punt on this for now (maybe with the exception of the assumed spectral index and assumed power-law model within the bin).

woodmd commented 8 years ago

I committed a few updates to both the Likelihood SED and Likelihood SED Cube formats after iterating with @eacharles. We made a few changes to the column names (e.g. changing E_CTR to E_REF) and also made a clearer delineation between required and optional columns (although this is still open for discussion). We also decided to split the SED Cube across two HDUs:

As you requested I also made a draft of a flux point format spec that uses a subset of the columns from the likelihood SED format spec.

Comments would still be welcome at this stage -- in particular it would be useful to know if people think this format contains everything that would be needed for sharing spectral analysis results. @cdeil - could you see using the flux point format in gammapy for instance?

cdeil commented 8 years ago

@woodmd - Thanks!

Everyone -- the flux point spec is now here, please comment!

Especially @zblz - could you have a look at that flux point spec, please? Is there any chance that you give up the flux point data format page in the Naima docs (see here) in favour of the one here (after it's been improved / works for all your use cases)?

Also @joleroi as the one working on spectra in Gammapy at the moment, could you please have a look?

cdeil commented 8 years ago

Here's some comments from me on the flux point spec:

zblz commented 8 years ago

Taking a look at the flux point format, it is my impression that the use case you are thinking for it is slightly different than the one I envisioned for the naima one. This spec is brilliant to save and share the results of an analysis that has just been done (in gammapy, 3ML, etc.), whereas the one I developed for naima had the main goal of being a format to input external SEDs into naima. Because of that, the naima one is much more lax, and for that use case I think it has to be like this. For example, when taking a multiwavelength SED from several different papers, it is highly unlikely that the user has the info for all the required columns in the spec (emin, emax, eref, differential and integral fluxes and model flux, error in the normalization rather than in flux, etc). As @cdeil already mentioned, if the required columns where reduced significantly I would definitely be able to use it in naima.

A couple additional comments:

cdeil commented 8 years ago

I think it would be useful to have a unit keyword for columns instead of hardcoded units (MeV, MeV cm-2s-1, etc) in the spec. For naima I have found it really useful when fitting data from radio to TeV, as all conversions are then done within naima instead of having to change the data tables.

There's pros and cons to fixing units. E.g. if a covariance matrix is stored (not the case here), it can be difficult to understand the entries if the units aren't fixed and tools can change the units on read/ write. But I agree, here and almost everywhere else the units shouldn't be fixed, but stored in the file.

The problem is what to write here in the spec though?

@zblz or @woodmd - Do you have a suggestion how to specify physical dimensions or units in the spec?

There should also be a plain-text spec, whether in ECSV, Ipac, or siimilar.

I agree that it should be possible to save flux points in e.g. ECSV.

But we should avoid having to write a separate spec for this. Instead, we should clarify that we are only specifying header key and column field names and semantics and it's possible to read / write this to FITS, but also other formats that support such fields and headers, e.g. ECSV.

joleroi commented 8 years ago

For example, when taking a multiwavelength SED from several different papers, it is highly unlikely that the user has the info for all the required columns in the spec

I completely agree. While the current spec is very useful to save results of a spectral analysis, IMO the use case of taking some flux points from a paper or catalog should also be covered.

woodmd commented 8 years ago

Let me try my best to address the comments thus far:

Currently there's a heading Flux points and right after SED. We should pick one heading and then explain in the main text that the other term is used equivalently instead of having two headings.

This was partly intentional. The SED format is explicitly defined as a binned format whereas it seems the flux point formats currently used by gammapy and naima are intended for unbinned differential flux points. Perhaps a solution for reconciling the two formats would be to define flux points as a subset of the SED format. e.g. the only required columns would be a sequence of energies and differential fluxes (e.g. E_REF, REF_DFDE, and NORM in the current version of the SED format).

I think that too many of the columns are required at the moment, more should be optional. The proposal here of having energy, flux and flux error as the required columns is better IMO, I would have even put flux error as optional.

See my previous comment about the difference between binned and unbinned flux formats. For a binned format I think it's hard to drop any additional columns. Regarding the normalization I think we should require at least one integral and one differential measure of the flux. If we only want to require one measure of integral flux then I would vote for using energy flux rather than flux.

Specifically from the flux spec alone it's a bit cryptic what the "reference model" is. "REF" is first used for the reference energy, which has nothing to do with the "reference model". And then "REF" is added everywhere to refer to the "reference model". Is that really needed?

We originally called this column E_CTR but since this energy can be anywhere in the bin we thought this might be misleading. I'm open to alternative suggestions for the name of this column if you have one. Maybe E_DFDE? I would prefer to keep "REF" in the column names for the reference model quantities in order to avoid any confusion with measured values.

I think it would be useful to have a unit keyword for columns instead of hardcoded units (MeV, MeV cm-2s-1, etc) in the spec.

I fully agree that the units should not be hardcoded. I only put the units in the spec to indicate the dimensionality of each column.

There should also be a plain-text spec, whether in ECSV, Ipac, or siimilar.

Since BINTABLE shows up quite frequently I think it would be easier just to define a convention for converting a BINTABLE to these other formats. I believe astropy already has its own convention for this, i.e. a BINTABLE loaded with astropy.table.Table can be written to a ecsv or ipac file.

@zblz or @woodmd - Do you have a suggestion how to specify physical dimensions or units in the spec?

My suggestion is that the spec should define the dimensionality and maybe a "recommended" physical unit. That said if the units are defined in the metadata it really shouldn't matter what units are used. I defined all the units in MeV but I doubt that the IACT folks will want to move away from using GeV/TeV units.

For me DIFF_FLUX and INT_FLUX would be more intuitive than REF_DFDE and REF_FLUX. From the spec it is not clear to me what the NORM column contains

The NORM columns contain the given quantity normalized to the amplitude of the reference model. The NORM vs REF columns are motivated by the fact that measured fluxes can be zero in which case one cannot use the measured values to extract the conversion between differential and integral flux.

kosack commented 8 years ago

Just a comment about the units/quantities: it may be a bit more "VO" than FITS, but UCDs (uniform content descriptors) are designed exactly for that purpose: they allow you to specify what physical quantity is associated with a column, without specifiying the exact unit (which may be up to the user). It allows a semantic description like "flux" or "energy". I think there is even a FITS header keyword for columns that lets you specify it (not all tools understand it of course).

The headers look like this:

TBUCD1  = 'meta.id;meta.main'      

see e.g. https://groups.google.com/forum/#!topic/astropy-dev/8TxV-CLnF1w and the IVOA documentation on UCDs

jknodlseder commented 8 years ago

Agree with Karl that we should look into the UCDs. However, they don't solve the units problem. Having played a bit with VO it looks like an unsolved problem on their side (UCDs help you to understand the physical quantity, but they don't tell you in what units the quantity is given).

jknodlseder commented 8 years ago

Here a link to a VO initiative about units: http://ivoa.net/documents/VOUnits/20140523/VOUnits-REC-1.0-20140523.pdf

woodmd commented 8 years ago

Having the UCDs sounds useful but I think I'd prefer to defer this to a future version of the spec. For the moment I only see this format being used for LAT and IACT analysis so I think we can probably safely rely on people knowing what dimensionality flux, energy, etc. should have. For the spec page itself I think having a "suggested" unit should be sufficient to indicate what dimensionality each column should have.

I'm still curious what people thought of the idea of creating an unbinned flux spec that uses a subset of the columns from the current binned flux (SED) spec. The unbinned flux point spec would only require three columns (E_REF, NORM, and REF_DFDE) with columns for errors, ULs, etc. designated as optional. If people are happy with this proposal I can go ahead and draft something for the flux points spec page. If people have strong feelings about the column names it would also be useful to get feedback on that soon since we are currently preparing a new ST release that will use the new column names.

cdeil commented 8 years ago

I'm currently in the process of collecting all published flux points from HESS into ECSV tables and for now I'm using this format: https://github.com/gammapy/gamma-cat/blob/master/input/schemas/sed-format.md#format

I'd like to use a format with a spec here.

http://gamma-astro-data-formats.readthedocs.io/en/latest/results/flux_points/index.html#required-columns currently doesn't work for this use case, because too many columns are required there.

Often, only energy and flux and flux_error are published, e.g. https://www.mpi-hd.mpg.de/hfm/HESS/pages/publications/auxiliary/VelaX_auxinfo.html

You can find many more cases of published HESS flux points here https://github.com/gammapy/gamma-cat/blob/master/todo/todo_hess_aux.md and see how I put that in more uniform, machine-readable tables here: https://github.com/gammapy/gamma-cat/tree/master/input/papers

I think my preference would be for http://gamma-astro-data-formats.readthedocs.io/en/latest/results/flux_points/index.html#required-columns to support this use case and make more (all?) of the columns optional there. I can use it now, but then I'd have to fill several columns with nan or some other magic number and have to invent my private convention or header keyword to declare that those columns really aren't available.

If people really prefer, we can make a new format for this. But I don't see why that would be better than one format that's more flexible with more optional things. Then the question becomes when to use which and when to switch from the "simple" to the "full-featured" format.

@woodmd @joleroi and all - Thoughts?

joleroi commented 8 years ago

I'm still curious what people thought of the idea of creating an unbinned flux spec that uses a subset of the columns from the current binned flux (SED) spec. The unbinned flux point spec would only require three columns (E_REF, NORM, and REF_DFDE) with columns for errors, ULs, etc. designated as optional.

Looks like we're basically on the same page

woodmd commented 8 years ago

I would be in favor of creating a reduced/simplified format that would be appropriate for the scenario that @cdeil describes. If you're happy with the unbinned format I described I could go ahead and commit what I had in mind and we can iterate on that further.

One question that still remains in my mind is connected to the discussion in #61. Should the format require a particular dimension for the normalization quantity -- for instance requiring it to be a differential flux? Are there other kinds of differential quantities that we would ever want to support with this format?

cdeil commented 8 years ago

@woodmd - Thanks for volunteering to do the work and updating / adding to the spec here!

Before you do: I'm still in favour of just making more columns of the existing format optional. To me, that seems like the simplest solution both for the spec. What we do here is to define field names and semantics, so that we have a format for data exchange. Tools can then easily check which fields are present for a given SED and process or plot it.

You wrote:

The SED format is explicitly defined as a binned format whereas it seems the flux point formats currently used by gammapy and naima are intended for unbinned differential flux points. Perhaps a solution for reconciling the two formats would be to define flux points as a subset of the SED format. e.g. the only required columns would be a sequence of energies and differential fluxes (e.g. E_REF, REF_DFDE, and NORM in the current version of the SED format).

E.g. Fermi catalogs only contain integral flux per bin. TeV instruments usually only publish differential flux measured in energy bins. In both cases we do (often) want to store e_min and e_max, i.e. the energy bin. I don't consider either of those "unbinned" and I think they fit well with the fields you have already defined.

+1 to use a subset of the SED format.

But apart from maybe E_REF you can't have any required columns, they must all be optional, to support storing the info given for flux points in the Fermi catalogs or as published by TeV instruments.

You also wrote:

See my previous comment about the difference between binned and unbinned flux formats. For a binned format I think it's hard to drop any additional columns. Regarding the normalization I think we should require at least one integral and one differential measure of the flux. If we only want to require one measure of integral flux then I would vote for using energy flux rather than flux.

I don't understan why it's hard to change the columns from required to optional. Technically, in the spec, it's super-easy.

For tools, well, they can just access the fields they need and give an error if one is missing. Yes, there is some benefit to having more required fields you can rely on there. But I don't think it's enough to warrant introducing a new "format" which is just the same, only with more columns optional.

Am I not getting something?

If there's no agreement on this for now, then I'm still +1 to go ahead and define the simple flux point spec now, and then discuss this question again at some future telcon or f2f meeting.

woodmd commented 8 years ago

So you would advocate having a single SED format with all columns except E_REF as optional? My worry is that by allowing too much freedom in which columns can be included you're going to lose a lot of the advantages of having common data formats in the first place. In my view the goal of these data format specs is not only to agree on column specifications but also to define the format in such a way that one doesn't need package-specific I/O methods. That is a file in format X should be interoperable with all packages that provide I/O methods for that format. Once you are checking for the existence of certain columns to decide whether you can parse a file you are effectively defining a new format since there's nothing in the specification itself that would tell you that those columns are required for your package.

For now I'd prefer to go with a separate flux point spec keeping in mind that we might revisit this in the future.

cdeil commented 8 years ago

So you would advocate having a single SED format with all columns except E_REF as optional?

There are publications with only E_MIN, E_MAX and REF_FLUX given. And there are publications with only E_REF and REF_DFDE given. Zero overlap.

I'd like to be able to store both of those flux points, I think these are valid use cases and we should support them via this format.

So I'm advocating to make them all optional.

In my view the goal of these data format specs is not only to agree on column specifications

I disagree. I think the main value this spec has is to define field names and semantics. And it should be flexible enough to support all common use cases of how people share SED data.

We already have optional columns, and tools will give an error message if the user requests some operation that requires that data and it isn't present. Yes, this is some extra burden on tool writers, but not much and I prefer it to a separate spec.

It's really easy to make all columns optional in the spec. Just change the "Required columns" to "Columns" and mention with one paragraph why for this format all are optional with the example I give here: support published integral and differential flux points.

For now I'd prefer to go with a separate flux point spec keeping in mind that we might revisit this in the future.

How would that look like? It would just be a copy with one extra comment that in this second spec all columns are optional, no?

cdeil commented 8 years ago

I have a comment about the field names for http://gamma-astro-data-formats.readthedocs.io/en/latest/results/flux_points/index.html

I find the use of REF confusing.

Currently there's E_REF for the reference energy used as a postfix, and then there's REF used for the "reference model" used as a prefix for many fields:

The following fields also depend on the "reference model", but don't have the REF prefix:

As far as I can see there's always just one "reference model", never several reference models. So my suggestion would be to drop the REF prefix everywhere because it's not needed. This would also avoid the possible confusion with the use of REF to refer to the "reference energy".


Another comment I have is why only NORM gets to have errors, but the fluxes don't. This is insufficient for the published flux point use cases I mentioned above.

My suggestion would be to always define _ERR, ERR_HI and ERR_LO and UL post-fixes to be valid for fluxes that have measured errors.

@woodmd - I guess your original use case for this was for spectral analysis results where the main quantity is NORM that's fitted, and then you only have the flux fields as convenience, but assume that error propagation is always such that those can be computed from the NORM errors?

So could you please make a PR with a new proposal (either modify extend the existing spec like I suggest or make a second one like you suggested) on this? Or should we do a quick Google hangout (including @joleroi and anyone else that's interested in this) to discuss and hopefully find a solution on this that sounds good to everyone interested in the coming days? )This is borderline complex and time consuming to discuss back and forth via Github.)

@woodmd and all - Thoughts?

woodmd commented 8 years ago

It seems like a dedicated call might be the most efficient way to sort this out. I'm available anytime this week after 8 AM PST.

cdeil commented 8 years ago

Here's a Dudle (German ad-free Doodle) to find a time: https://dudle.inf.tu-dresden.de/gamma-sed/

cdeil commented 8 years ago

Google hangout tomorrow, Thu, Sep 8 at 9 pm CEST

Everyone interested is welcome to join!

I hope I did the Google calendar event with added Google hangout scheduled correctly!?

https://calendar.google.com/calendar/event?action=TEMPLATE&tmeid=cTZucXU0bTRyaGFqMmFiam8waGgzN3FmNTggZGVpbC5jaHJpc3RvcGhAZ29vZ2xlbWFpbC5jb20&tmsrc=deil.christoph%40googlemail.com

cdeil commented 8 years ago

https://hangouts.google.com/call/p5ib2noogndwzcuwlebe7ujmsye

cdeil commented 8 years ago

Yesterday @woodmd , @cboisson and I had a Google hangout and discussed this. Here's the summary:

@woodmd @cboisson - Please add or clarify as needed. Everyone - comments welcome!

Jvinniec commented 4 years ago

Before I comment, here's my understanding of how the format has been setup:

Provided my understanding is correct, it seems that there is degeneracy between the two representations when you want to have both SED points with the likelihood profile in the same extension. You end up with duplicated columns for the flux (i.e. ref_dnde and dnde). On the other hand, if the goal is to require putting the SED points and the likelihood profile in different extensions then you are still duplicating information in the file, but now potentially also duplicating the energy binning information (e_ref, e_min, and e_max).

My suggestion would be to make the likelihood table format be an expanded version of the SED points format. Something like what's done for the SCANDATA and EBOUNDS tables:

So you basically require an EBOUNDS table and add the SCANDATA table if a likelihood profile is desired. In this way, the flux value given in the SED table serves as the reference flux for the values in norm_scan as well as containing the energy boundary information, no duplication necessary.

I would also prefer to settle on a convention of column names all being uppercase or lowercase regardless of where they appear in the file.

lmohrmann commented 4 years ago

@Jvinniec I don't have a strong opinion here, but this should obviously coordinated between tool developers.

@registerrier or @adonath for Gammapy: thoughts?

adonath commented 4 years ago

@Jvinniec I'm not sure I understand the motivation for your proposal.

The format for the SED likelihood profiles for flux points is fully documented here. For flux points data is always stored in a table. In the likelihood format the information is contained in the norm_xxx columns. Given a reference flux such as ref_dnde, ref_flux, ref_eflux etc. the corresponding quantities such as dnde, eflux and flux can be derived with a simple multiplication (dnde = ref_dnde * norm) and can be stored as optional columns in the table. In this sense the information is duplicated and makes the likelihood profile format a superset of all the other formats. But as the columns are optional and never take much storage, I think there is no problem in "duplicating" the information and pre-compute those columns for the user. Energy information such as e_ref, e_min and e_max is never duplicated and only contained once in a single flux points table. The norm_scan and dloglike_scan columns are array valued columns. So for flux points there is always just one BinTableHDU extension, independent of the format (dnde / flux / likelihood etc.)

The EBOUNDS and SCANDATA hdu names you are referring to belong to a different format: the likelihood sed cube. This format is an extension of the concept of likelihood profiles to TS images (used by fermipy), with the totally different use case of (I think) source detection. What you have is basically one energy dependent likelihood profile per spatial bin, stored in a flattened BinTableHDU, with one spatial position per row. The separate EBOUNDS table is introduced (I guess) to match the fermi format for data cubes.

Jvinniec commented 4 years ago

@adonath Thanks for the response and the clarification. My question is actually based on what I understood after reading the documentation, but it's clear that I had things a bit mixed up so forget my format suggestions above.

I just want to make sure I understand the format correctly. All of the following should be valid ways of representing the SED table according to the format (assume e_ref, e_min, and e_max are always present):

SED_TYPE Columns included Description
A dnde dnde, dnde_err, dnde_ul Simplest representation
B dnde ref_dnde, norm, norm_err, norm_ul Equivalent to A, but using the ref_ column format
C dnde, flux, or norm (?) ref_dnde, ref_flux, norm, norm_err, norm_ul Similar to B, but extra flux representations are included
D likelihood ref_dnde, ref_flux, ref_eflux, ref_npred, norm, norm_err, norm_ul, ts, loglike, norm_scan, dloglike_scan Similar to B and C, but with added likelihood columns

Provided that's correct, I have a couple of questions:

  1. Are all possible ref_x normalization representations required in the likelihood SED? The documentation makes it seem like you have to specify ref_dnde, ref_flux, ref_eflux, and ref_npred. If not, does the format specify how to identify what representations are included?
  2. Is there a way to differentiate between (A), (B), and (C)? For example, could we make the SED_TYPE keyword into a comma-separated list of included representations? Something like:
    • A: SED_TYPE=dnde
    • B: SED_TYPE=norm,dnde
    • C: SED_TYPE=norm,dnde,flux
    • D: SED_TYPE=likelihood,dnde,flux,eflux,npred (also solves question 1)

This way a tool reading the SED files knows exactly what columns to look for without having to scan the column names. For example the following can then be done:

norm_reps = SED_TYPE.split(',')
if norm_reps[0] == 'likelihood':
   # likelihood columns included
   # assume 'ref_' notation for all flux reps present
elif norm_reps[0] == 'norm':
   # no likelihood columns
   # assume 'ref_' notation for all flux reps present
else:
   # no likelihood columns
   # assume separate norm, error, and upper limit columns for all flux reps present

Thoughts?