CALDB and IRF keywords in IRFs

fabiopintore commented 3 years ago

The GADF format of the EVENTS does not take into account any keyword (neither mandatory nor optional) that relates the events to a given IRF (and corresponding calibrations). Here, it is suggested to add the keywords IRF and CALDB: the former can provide the name of the IRF (or its MC production), while the second can give the configuration status (for example, time of the IRF optimisation).

maxnoe commented 3 years ago

@fabiopintore Hi, thank you for making this proposal.

Just to note: there is a mechanism in GADF to associate EVENTS to IRFs: the HDU index tables documented here: https://gamma-astro-data-formats.readthedocs.io/en/latest/data_storage/hdu_index/index.html

bkhelifi commented 3 years ago

Does it mean that for GADF the HDU tables are mandatory? If yes, one might need the change the description of the HDU table. On the link that you provide, it is written "The HDU index table can be used to locate HDU"... It is not 'can be used', but rather 'IS USED'... A consistency might be needed, and beyond that a decision about the mandatory aspect of this index tables...

An other question open question: the name 'CALDB' is not generally used as a directory where data are stored? It is not used to store metadata describing how/when/by who the IRFs are made...

maxnoe commented 3 years ago

Does it mean that for GADF the HDU tables are mandatory?

No they are not mandatory in the standard. They are only needed if it is not clear which IRFs are used for which observations. The simplest way to associate EVENTS with their respective IRFs is just to have them in the same file.

However, analysis software may have to make additional requirements to work. I think for doing a gammapy analysis, at the moment the HDU Table and OBS Index tables are required and even some of the columns that are optional in the standard (becuse they just duplicate information that is mandatory in the HDU header of the corresponding HDU) are required by gammapy. This is however specific to gammapy and there are several issues open to make gammapy require less of the optional columns (e.g. in the case above by just reading the values from the HDU header instead of the index / obs table).

An other question open question: the name 'CALDB' is not generally used as a directory where data are stored? It is not used to store metadata describing how/when/by who the IRFs are made...

Yes, the proposal here is very vague on what the new IRF and CALDB header keys should actually contain. It is not clear to me, how a single IRF header can specify the IRF, since we have multiple IRF component HDUs, that all need to be found somehow, if they are not just in the same file.

The proposal about CALDB is completely unclear to me.

TarekHC commented 3 years ago

Hi all,

Thank you @fabiopintore for opening the issue! Associating events with IRFs is a long standing issue that we have discussed a lot. My 2 cents on the topic!

I'm not sure we want to add keywords to the EVENTS to associate them to IRFs. The reason why I believe this is not a good approach is because there could perfectly be several IRF components (from the same kind!) describing the same list of events. Example: For a given run, I would expect we could store both full-enclosure and point-like IRFs in the same DL3 file. Point-like IRFs will be better suited for a point-source analysis, while any 3D/extended analysis would need to use the full-enclosure ones.

For this reason, I feel it would make more sense to associate each IRF components to the specific event list they were "generated for". Until now, we just assume IRF components within the same file as an EVENTS table will essentially be associated to it, but as we know, there are many cases in which we want to separate EVENTS and IRF components.

For this reason, even if I fully agree this is an issue that needs to be addressed, I disagree with the proposed solution.

Just to note: there is a mechanism in GADF to associate EVENTS to IRFs: the HDU index tables documented here: https://gamma-astro-data-formats.readthedocs.io/en/latest/data_storage/hdu_index/index.html

As @maxnoe correctly pointed out, for describing each IRF component we should use the HDU index definitions we already have. Although, unless I'm mistaken, HDU index describes "the format" of each IRF component within a file, but does not associate specific events to IRFs.

maxnoe commented 3 years ago

Although, unless I'm mistaken, HDU index describes "the format" of each IRF component within a file, but does not associate specific events to IRFs.

It associates an EVENTS table to its IRFs via the OBS_ID column.

maxnoe commented 3 years ago

E.g. for the FACT data in the open crab sample:

In [4]: t[t['OBS_ID'] == 20131105212]
Out[4]: 
<Table length=4>
   OBS_ID   HDU_TYPE HDU_CLASS FILE_DIR       FILE_NAME            HDU_NAME    
   int64     bytes6    bytes8   bytes2         bytes21             bytes17     
----------- -------- --------- -------- --------------------- -----------------
20131105212   events    events       ./ 20131105_212_dl3.fits            EVENTS
20131105212      gti       gti       ./ 20131105_212_dl3.fits               GTI
20131105212     aeff   aeff_2d       ./         fact_irf.fits    EFFECTIVE AREA
20131105212    edisp  edisp_2d       ./         fact_irf.fits ENERGY DISPERSION

TarekHC commented 3 years ago

It associates an EVENTS table to its IRFs via the OBS_ID column.

I see. I feel current OBS_ID specifications are not clear for the case of an IRF, specially if this is the event <-> IRF association we want.

For instance: It is expected many instruments will generate separated EVENTS and IRFs as "productions" (and not run-wise) as done by the current generation of IACTs, LAT, and for CTA at the moment (e.g. CTA's first data challenge). Those IRFs will not have a given observation run associated to them, and therefore won't be compliant with this format.

@maxnoe is the case I describe above already covered?

maxnoe commented 3 years ago

I see. I feel current OBS_ID specifications are not clear for the case of an IRF, specially if this is the event <-> IRF association we want.

The observation ID is a property of the EVENTS table in the current version of the standard. The HDU index table is generated for a specific analysis and an IRF row with an OBS_ID mean "use this IRF for this OBS_ID" not that the OBS_ID is a property of that specific IRF.

maxnoe commented 3 years ago

For instance: It is expected many instruments will generate separated EVENTS and IRFs as "productions" (and not run-wise) as done by the current generation of IACTs, LAT, and for CTA at the moment (e.g. CTA's first data challenge).

Sure, as does the FACT example, we actually use the same IRF for all runs in the open crab sample. So every OBS_ID is associated witht he same irf in the IRF file. I should have shown more than the one OBS_ID in the file:

   OBS_ID   HDU_TYPE HDU_CLASS FILE_DIR       FILE_NAME            HDU_NAME    
   int64     bytes6    bytes8   bytes2         bytes21             bytes17     
----------- -------- --------- -------- --------------------- -----------------
20131103093     aeff   aeff_2d       ./         fact_irf.fits    EFFECTIVE AREA
20131103093    edisp  edisp_2d       ./         fact_irf.fits ENERGY DISPERSION
20131103093   events    events       ./  20131103_93_dl3.fits            EVENTS
20131103093      gti       gti       ./  20131103_93_dl3.fits               GTI
20131103103     aeff   aeff_2d       ./         fact_irf.fits    EFFECTIVE AREA
20131103103    edisp  edisp_2d       ./         fact_irf.fits ENERGY DISPERSION
20131103103   events    events       ./ 20131103_103_dl3.fits            EVENTS
20131103103      gti       gti       ./ 20131103_103_dl3.fits               GTI
20131103104     aeff   aeff_2d       ./         fact_irf.fits    EFFECTIVE AREA
20131103104    edisp  edisp_2d       ./         fact_irf.fits ENERGY DISPERSION
20131103104   events    events       ./ 20131103_104_dl3.fits            EVENTS
20131103104      gti       gti       ./ 20131103_104_dl3.fits               GTI

maxnoe commented 3 years ago

I feel current OBS_ID specifications are not clear for the case of an IRF,

There is currently no OBS_ID field for IRFs, there is an issue about this: https://github.com/open-gamma-ray-astro/gamma-astro-data-formats/issues/132

As said, at the moment there are two ways to associate EVENT tables with their IRFs:

By being in the same file
By using the HDU-INDEX table to give each EVENT its IRFs by using the OBS_ID of the EVENTS table.

The second allows every use case I can imagine, since it allows the association of a completely arbitrary IRF HDU in a specific file to each individual EVENTS table. Maybe I'm missing something, but that should cover everything.

The HDU INDEX table can be generated for each analysis (e.g. specifying that point-like IRFs should be used if both are available from the production of the IRFs)

TarekHC commented 3 years ago

Argh, ok. I was mixing in my head HDUCLASn keywords with the HDU index files that you were referring to. Its been a while since I took a look to this, sorry...

@maxnoe HDU index tables, as you say, are generated per analysis (do all science tools use them?). What is/could be missing is metadata within the IRF components describing to which events they may be applied to (which is what I was referring to above).

If we don't provide such metadata, there will be a high risk associated to people creating their own HDU index tables potentially using wrong IRF <-> events associations.

maxnoe commented 3 years ago

What is/could be missing is metadata within the IRF components describing to which events they may be applied to

Yes, but this is a much harder problem that depends on pretty much every detail of how these IRF files were produced and what thre required systematic uncertainties are (E.g. I can use a barely matching IRF file to get a rough idea but need to use a file very specific to the run conditions etc. for best possible results).

So I think this is a problem that has to be solved by the analysis chains producing the IRF files and HDU index files for a given analysis goal. It can probably not really be addressed by this standard.

TarekHC commented 3 years ago

I agree with you Max: This is something that definitely needs the input from lower-level (provenance) experts. Certainly should not be discussed here.

@fabiopintore please close the issue if you were requesting a similar use case as the HDU tables described be @maxnoe

adonath commented 3 years ago

As said, at the moment there are two ways to associate EVENT tables with their IRFs:

By being in the same file

By using the HDU-INDEX table to give each EVENT its IRFs by using the OBS_ID of the EVENTS table.

I think this is not sufficient. The issue came up in the context of updating the Gammapy tutorials to the latest alpha configuration of CTA IRFs. Doing this we noticed there is currently no simple header keyword or combination of header keywords, that would show which IRFs have been used for the simulation and thus to distinguish the event files created from different iRFs. One would either rely on naming the file correctly or users adding additional comments to the event header. Otherwise the information is just lost on the event list level, which is an issue I think.

I think only relying on the index table is not a good idea. The index files rely on filenames, which are arbitrary. This means if e.g. the index file you download is corrupted, there is not way to reconstruct the information from the data files themselves. In case of the observation index file, there is currently only one reason it exists at all: performance. All the information is present in the event header, but it is just to costly to open all event files for data selection. So the information is duplicated in an independent observation index table. I think a similar structure should apply to the HDU index. However I think writing hard-coded filenames to IRF files in the event header (such as in the OGIP standard) is a bad idea, because once data is copied to a different files system they might not be valid anymore. However some unique identifier or combinations of keywords will work well.

Also creating an hdu index file for simulations, where sometimes only one set of IRFs is used is probably "overkill". There are other situations, where such a keyword would be useful. E.g. for HAWC analysis all the entries in the HDU index table basically link to the same IRFs, because there are only different passes / versions of all-sky IRFs productions and not IRFs associated to OBS_ID.

Yes, there is the OBS_ID , but for simulated event lists this is just an arbitrary identifier chosen by the user. Also for real data there can be multiple reconstruction versions of the same data for the same OBS_ID. In that sense it's not even unique, only with additional information. It would require at least some combination of OBS_ID and a new keyword such as RECOPASS=pass3 or similar.

adonath commented 3 years ago

Just to add: I think it should be possible that an event file lives independently and provides all the necessary info that is need to analysis it. However the information does not have to be provided in a convenient way or a way that allows for performant data access.

adonath commented 3 years ago

@fabiopintore I think the event files simulated by ctools include this information on the CALDB, but as a custom non-GADF keyword. Is this correct?

maxnoe commented 3 years ago

I think this is not sufficient. The issue came up in the context of updating the Gammapy tutorials to the latest alpha configuration of CTA IRFs. Doing this we noticed there is currently no simple header keyword or combination of header keywords, that would show which IRFs have been used for the simulation and thus to distinguish the event files created from different iRFs.

This is different, only tangentially related issue. This standard at the moment is mostly if not exclusively concerned with the storage and analysis of observed data, not artifical simulations. We have several issues about extending this (See true energy columns in EVENTS for example: https://github.com/open-gamma-ray-astro/gamma-astro-data-formats/issues/30).

adonath commented 3 years ago

@maxnoe Yes, I mostly agree.

But it also applies to observed data in a broader context: Typically different reconstruction of the same data (e.g "pass6/pass7/pass8" in the Fermi context or different cuts for IACT data) are just stored in different folders with different HDU and OBS index files. However I think the information should be present in the event file header as well. One should not rely on the data being stored in a correspondingly named folders. Some unique identifier is needed for this and it goes beyond the current event type keyword.

The second use case on observed data is e.g. HAWC analysis, where there is no huge IRF database, but just time-independent all-sky IRFs to be used for all events. So just storing the event type and pass version is completely sufficient to link the events to the IRFs, that users might have already copied anyway, without needing to create and index file. Maybe @LauraOlivera can comment on this as well.

maxnoe commented 3 years ago

@adonath I agree there is a large room for improvement here, but as @TarekHC mentioned, there is no one-to-one mapping for observed data, so having a header card in the EVENTS specifying a single IRF is not ok.

fabiopintore commented 3 years ago

I read only now all the discussion but I quite agree with @adonath . I understand that the standard format is intended for real data and not for simulated ones, but I think that event simulations still represent a not negligible part of the current gamma-ray activity. Since HDU index files are not provided by default if a user simulates them, at this moment these data would be completely unrecognizable if no info on the progenitor IRFs is given. My 2 cent on the issue is that we would need one (or maybe more than one) identificative keyword(s) in the header, although I agree that the one-to-one mapping cannot be always satisfied.

TarekHC commented 3 years ago

Hi all,

This is clearly a provenance issue, and unfortunately it will be strongly related to details of the lower-level analysis, which may be very different for different instruments. It won't just be the MC file used to simulate events... It will be the specific energy look-up table, the gamma/hadron separation matrix, etc...

I tend to agree with @maxnoe: having a header specifying a single IRF feels wrong to me.

We have the following keywords with the software version used:

Why not just doing exactly the same as done by LAT? Add to the list above an optional keyword PASS_VER, with a string being the code name of the "pass"? Would this be enough @adonath @fabiopintore ?

open-gamma-ray-astro / gamma-astro-data-formats

CALDB and IRF keywords in IRFs #183