Clarify what kinds of arrays are allowed for IRFs

maxnoe commented 4 years ago

Currently the standard asks for the storage of two arrays of lower and upper bin edges.

In python:

# 50 bins between 1 GeV and 1 PeV
bin_edges = np.logspace(-3, 3, 51)
energ_lo = bin_edges[:-1]
energ_hi = bin_edges[1:]

same goes for theta

I would propose to change this for the next iteration of the standard to change this to just storing a single array with bin edges.

tibaldo commented 4 years ago

This seems dangerous to me. Are we guaranteed that the bins will always be contiguous? Is there only one unambiguous way in your proposed format to determine the edges of a bin? Also, this is inconsistent with well-established conventions, e.g. the EBOUNDS extension in OGIP (link). What are the needs that motivate your request exactly?

TarekHC commented 4 years ago

Yes, overlapping bins require the two arrays. This should not be changed.

cboisson commented 4 years ago

Yes, this should not be changed.

Le 27/03/2020 à 17:22, Tarek Hassan a écrit :

Yes, overlapping bins require the two arrays. This should not be changed.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/open-gamma-ray-astro/gamma-astro-data-formats/issues/145#issuecomment-605092053, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADY2AKTFEWHTC3HTITVHSZ3RJTHE5ANCNFSM4LVDOZQQ.

-- Catherine Boisson UMR 8102, Laboratoire Univers et THeories (LUTH) Observatoire de Paris-Meudon, F-92195 Meudon Cedex Direct Line: Tel. +33 1 45 07 74 36

maxnoe commented 4 years ago

Ok, I see that it's probably better to be kept as is, however I think we should then add some clarifications to the standard on what is allowed or expected and what not.

Are we guaranteed that the bins will always be contiguous?

We want to enable non-contiguous e.g. effective areas? What are the use cases where this is not the case? How would the science tools deal with non-contiguous effective areas @adonath @jknodlseder ?

Yes, overlapping bins require the two arrays. This should not be changed.

Again, I'd like to know some use cases here. I think overlapping bins might be usefull for interpolation or getting closer to something like "differential effective area" than "total effective area in bins of energy".

It would be good to collect references for this and make clear what is allowed in the standard and what is not.

In my opinion, we should only add well defined things to these standards. So if we want to give the possibility for non-contiguous effective areas or overlapping bins, this should be mentioned in the respective documents.

A standard that basically allows storing anything as long as it has the correct names and shape is not very helpfull if the semantics of what is stored are not clear.

jknodlseder commented 4 years ago

Currently the ctools / GammaLib simply take the bin centres and interpolate between them bi-linearly (for 2D arrays).

What's wrong with storing both bin edges for bins (whatever the bins are)? Memory should not be an issue, and this allows to know exactly how the bins are defined. And then everything is allowed (non contiguous and overlapping bins).

Le 28 mars 2020 à 11:56, Maximilian Nöthe notifications@github.com a écrit :

Ok, I see that it's probably better to be kept as is, however I think we should then add some clarifications to the standard on what is allowed or expected and what not.

Are we guaranteed that the bins will always be contiguous?

We want to enable non-contiguous e.g. effective areas? What are the use cases where this is not the case? How would the science tools deal with non-contiguous effective areas @adonath https://github.com/adonath @jknodlseder https://github.com/jknodlseder ?

Yes, overlapping bins require the two arrays. This should not be changed.

Again, I'd like to know some use cases here. I think overlapping bins might be usefull for interpolation or getting closer to something like "differential effective area" than "total effective area in bins of energy".

It would be good to collect references for this and make clear what is allowed in the standard and what is not.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/open-gamma-ray-astro/gamma-astro-data-formats/issues/145#issuecomment-605430877, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAW2QV7SM4IJZKKAZGFYJL3RJXJU7ANCNFSM4LVDOZQQ.

maxnoe commented 4 years ago

Ok, I assume that also answers what my next question would have been: why it was decided to use binary table extensions with a single row instead of image hdus for the irfs.

jknodlseder commented 4 years ago

Binary tables have proven flexible, effective and extensible for Fermi-LAT (the format is identical to the Fermi-LAT format)

Le 30 mars 2020 à 09:30, Maximilian Nöthe notifications@github.com a écrit :

Ok, I assume that also answers what my next question would have been: why it was decided to use binary table extensions instead of image hdus for the irfs.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/open-gamma-ray-astro/gamma-astro-data-formats/issues/145#issuecomment-605831529, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAW2QV6WCPI4ZZTBTQKC5PLRKBDDBANCNFSM4LVDOZQQ.

kosack commented 4 years ago

why it was decided to use binary table extensions with a single row instead of image hdus for the irfs.

I think also it's a question of binning - often you might have non-uniform binning in an IRF (even if right now we do not do so I think), for example to deal with lack of statistics at high energies. The IMAGE extension can't deal with that.

registerrier commented 4 years ago

Indeed science tools tend to take bin center from the bin edges and perform some type of interpolation. Yet, if no assumption is to be made on the binning (e.g. varying bin sizes or missing bins), there is no unambiguous determination of what is the bin center (linear, log, sqrt, other?). Depending on bin sizes, this will lead to significant differences in estimated values.

No IRF is differential in true energy/position. Therefore, when storing an IRF, one could consider that specifying one energy/position and the corresponding area/energy dispersion is less ambiguous. In particular, if MC simulations made to extract the IRFs were to be done at a single energy and position.

Standards such as OGIP/Caldb, consider that energy bins have to be small and that linear interpolation is the default: https://heasarc.gsfc.nasa.gov/docs/heasarc/caldb/docs/memos/cal_gen_92_003/cal_gen_92_003.html#tth_sEc7

There is no unambiguous interpolation scheme. Should one assume the response is flat inside a bin, linearly varying between two bin centers, log-linearly varying, log-log etc. We might need some specifications here as well.

maxnoe commented 4 years ago

@registerrier yes, this is exactly what I feared. That currently this standard is too under-specified to be useful on its own.

So at least adding ENERGY and THETA for the central point that should be used for interpolation would make sense, right? bin edges (low, high) could be optional then.

lmohrmann commented 4 years ago

@MaxNoe I think most commenters preferred to keep the low and high bin edges. @registerrier, you would propose to add some information on the default interpolation scheme to be used for any of the axes?

tibaldo commented 4 years ago

I think Régis has a valid point here. I am under the impression that volume is not an issue here. Would it be sensible and useful to add an extra column with a (recommended) reference energy for each bin, but also leaving the information of the energy boundaries for completeness?

maxnoe commented 4 years ago

Yes, this is what I proposed in my last comment, an additional ENERGREF or ENERGY or how we want to call it.

Btw, FITS column names are not bound to the uppercase and eight character limitatio of header keywords.

Is the another technical reason we limit ourselves to eight characters and uppercase?

adonath commented 4 years ago

In addition to what @registerrier commented:

Given the way we compute the IRFs, we should definitely keep the bounds information of the IRF bins. For me the only "strictly correct" way to interpret the information stored in the MC IRFs, is that the value represents the expected counts in a given parameter interval, which is an integrated quantity (even if divided by the bin width...). To keep the full information we should provide the integration interval.

In Gammapy we have two ways of interpolating the information stored in the IRFs:

Bilinear (sometime log-log or semi-log) interpolation of the mean values of the IRFs in the parameter bin, assuming either the linear or log center of the bin.
Quadratic interpolation of the cumulative distribution. This is an alternative method I have tried, that is closer to the "strictly correct" interpretation of the information I mentioned above. This methods avoids the need to define a "reference" value (or bin center...). It assumes that the value stored in the IRF corresponds to the integral in the given interval and assigns this value to the upper edge of the bin (assuming a linear distribution within the bin). This method also avoids the problem of extrapolation of the IRFs, because the cumulative distribution is 0 if evaluated outside the lowest bin edge and equivalent the total number of counts if evaluated outside the uppermost bin edge. An illustration of the method is here: https://github.com/adonath/notebooks-public/blob/master/histogram_resampling.ipynb

I think defining a reference value is only meaningful, if we provide an assumption on how the events are distributed within the bin in addition (in fact once this is defined the reference value can be easily computed from this...). In Gammapy we have a meta interp keyword for our MapAxis object which defines the scaling ("log", "linear", "sqrt") of the axis. This is then used to compute bin centers, logarithmic bin centers etc. and used to choose the interpolation method (e.g. linear, log-log, or semi-log) as well. I think providing this information on the "scaling" of the IRF axis in the meta data might be useful. On the other hand, in most cases there is a very plausible choice anyway (but I guess)

In Gammapy we already make assumptions of this kind e.g. background is interpolated in log-log in energy and linearly in spatial axes. The main advantage is a better precision when working with coarser bins. As long as the IRFs are binned fine enough (whatever this means) it should not matter anyway.

@MaxNoe I think so far there is no use-case for non-contiguous effective areas (or IRFs in general) and I'm not sure either, whether it will ever arise. To me the argument for the separate min and max bounds columns is simply that if the data is stored in a table, the column length is the equivalent to the column length of the data. If there was a column with edges, it would be of length N + 1 and you would need a fill value for the data...

maxnoe commented 4 years ago

@adonath actually, the IRFs are stored in a single rows in the tables.

The columns don't need to have the same shape, e.g. it would be perfectly valid to store:

Energy (51, ) Theta (11, ) Aeff (50, 10)

adonath commented 4 years ago

@MaxNoe Yes, you are right. I forgot the data is stored row-wise. For some reason I had in mind we store it as a flattened ND array in a table column.

open-gamma-ray-astro / gamma-astro-data-formats

Clarify what kinds of arrays are allowed for IRFs #145