Open fabiopintore opened 3 years ago
@fabiopintore Hi, thank you for making this proposal.
Just to note: there is a mechanism in GADF to associate EVENTS to IRFs: the HDU index tables documented here: https://gamma-astro-data-formats.readthedocs.io/en/latest/data_storage/hdu_index/index.html
Does it mean that for GADF the HDU tables are mandatory? If yes, one might need the change the description of the HDU table. On the link that you provide, it is written "The HDU index table can be used to locate HDU"... It is not 'can be used', but rather 'IS USED'... A consistency might be needed, and beyond that a decision about the mandatory aspect of this index tables...
An other question open question: the name 'CALDB' is not generally used as a directory where data are stored? It is not used to store metadata describing how/when/by who the IRFs are made...
Does it mean that for GADF the HDU tables are mandatory?
No they are not mandatory in the standard. They are only needed if it is not clear which IRFs are used for which observations. The simplest way to associate EVENTS
with their respective IRFs is just to have them in the same file.
However, analysis software may have to make additional requirements to work. I think for doing a gammapy analysis, at the moment the HDU Table and OBS Index tables are required and even some of the columns that are optional in the standard (becuse they just duplicate information that is mandatory in the HDU header of the corresponding HDU) are required by gammapy. This is however specific to gammapy and there are several issues open to make gammapy require less of the optional columns (e.g. in the case above by just reading the values from the HDU header instead of the index / obs table).
An other question open question: the name 'CALDB' is not generally used as a directory where data are stored? It is not used to store metadata describing how/when/by who the IRFs are made...
Yes, the proposal here is very vague on what the new IRF
and CALDB
header keys should actually contain. It is not clear to me, how a single IRF
header can specify the IRF, since we have multiple IRF component HDUs, that all need to be found somehow, if they are not just in the same file.
The proposal about CALDB
is completely unclear to me.
Hi all,
Thank you @fabiopintore for opening the issue! Associating events with IRFs is a long standing issue that we have discussed a lot. My 2 cents on the topic!
I'm not sure we want to add keywords to the EVENTS
to associate them to IRFs. The reason why I believe this is not a good approach is because there could perfectly be several IRF components (from the same kind!) describing the same list of events. Example: For a given run, I would expect we could store both full-enclosure and point-like IRFs in the same DL3 file. Point-like IRFs will be better suited for a point-source analysis, while any 3D/extended analysis would need to use the full-enclosure ones.
For this reason, I feel it would make more sense to associate each IRF components to the specific event list they were "generated for". Until now, we just assume IRF components within the same file as an EVENTS table will essentially be associated to it, but as we know, there are many cases in which we want to separate EVENTS and IRF components.
For this reason, even if I fully agree this is an issue that needs to be addressed, I disagree with the proposed solution.
Just to note: there is a mechanism in GADF to associate EVENTS to IRFs: the HDU index tables documented here: https://gamma-astro-data-formats.readthedocs.io/en/latest/data_storage/hdu_index/index.html
As @maxnoe correctly pointed out, for describing each IRF component we should use the HDU index definitions we already have. Although, unless I'm mistaken, HDU index describes "the format" of each IRF component within a file, but does not associate specific events to IRFs.
Although, unless I'm mistaken, HDU index describes "the format" of each IRF component within a file, but does not associate specific events to IRFs.
It associates an EVENTS table to its IRFs via the OBS_ID
column.
E.g. for the FACT data in the open crab sample:
In [4]: t[t['OBS_ID'] == 20131105212]
Out[4]:
<Table length=4>
OBS_ID HDU_TYPE HDU_CLASS FILE_DIR FILE_NAME HDU_NAME
int64 bytes6 bytes8 bytes2 bytes21 bytes17
----------- -------- --------- -------- --------------------- -----------------
20131105212 events events ./ 20131105_212_dl3.fits EVENTS
20131105212 gti gti ./ 20131105_212_dl3.fits GTI
20131105212 aeff aeff_2d ./ fact_irf.fits EFFECTIVE AREA
20131105212 edisp edisp_2d ./ fact_irf.fits ENERGY DISPERSION
It associates an EVENTS table to its IRFs via the
OBS_ID
column.
I see. I feel current OBS_ID
specifications are not clear for the case of an IRF, specially if this is the event <-> IRF association we want.
For instance: It is expected many instruments will generate separated EVENTS and IRFs as "productions" (and not run-wise) as done by the current generation of IACTs, LAT, and for CTA at the moment (e.g. CTA's first data challenge). Those IRFs will not have a given observation run associated to them, and therefore won't be compliant with this format.
@maxnoe is the case I describe above already covered?
I see. I feel current OBS_ID specifications are not clear for the case of an IRF, specially if this is the event <-> IRF association we want.
The observation ID is a property of the EVENTS
table in the current version of the standard. The HDU index table is generated for a specific analysis and an IRF row with an OBS_ID
mean "use this IRF for this OBS_ID" not that the OBS_ID is a property of that specific IRF.
For instance: It is expected many instruments will generate separated EVENTS and IRFs as "productions" (and not run-wise) as done by the current generation of IACTs, LAT, and for CTA at the moment (e.g. CTA's first data challenge).
Sure, as does the FACT example, we actually use the same IRF for all runs in the open crab sample. So every OBS_ID
is associated witht he same irf in the IRF file. I should have shown more than the one OBS_ID in the file:
OBS_ID HDU_TYPE HDU_CLASS FILE_DIR FILE_NAME HDU_NAME
int64 bytes6 bytes8 bytes2 bytes21 bytes17
----------- -------- --------- -------- --------------------- -----------------
20131103093 aeff aeff_2d ./ fact_irf.fits EFFECTIVE AREA
20131103093 edisp edisp_2d ./ fact_irf.fits ENERGY DISPERSION
20131103093 events events ./ 20131103_93_dl3.fits EVENTS
20131103093 gti gti ./ 20131103_93_dl3.fits GTI
20131103103 aeff aeff_2d ./ fact_irf.fits EFFECTIVE AREA
20131103103 edisp edisp_2d ./ fact_irf.fits ENERGY DISPERSION
20131103103 events events ./ 20131103_103_dl3.fits EVENTS
20131103103 gti gti ./ 20131103_103_dl3.fits GTI
20131103104 aeff aeff_2d ./ fact_irf.fits EFFECTIVE AREA
20131103104 edisp edisp_2d ./ fact_irf.fits ENERGY DISPERSION
20131103104 events events ./ 20131103_104_dl3.fits EVENTS
20131103104 gti gti ./ 20131103_104_dl3.fits GTI
I feel current OBS_ID specifications are not clear for the case of an IRF,
There is currently no OBS_ID
field for IRFs, there is an issue about this: https://github.com/open-gamma-ray-astro/gamma-astro-data-formats/issues/132
As said, at the moment there are two ways to associate EVENT
tables with their IRFs:
HDU-INDEX
table to give each EVENT
its IRFs by using the OBS_ID
of the EVENTS
table.The second allows every use case I can imagine, since it allows the association of a completely arbitrary IRF HDU in a specific file to each individual EVENTS
table. Maybe I'm missing something, but that should cover everything.
The HDU INDEX table can be generated for each analysis (e.g. specifying that point-like IRFs should be used if both are available from the production of the IRFs)
Argh, ok. I was mixing in my head HDUCLASn keywords with the HDU index files that you were referring to. Its been a while since I took a look to this, sorry...
@maxnoe HDU index tables, as you say, are generated per analysis (do all science tools use them?). What is/could be missing is metadata within the IRF components describing to which events they may be applied to (which is what I was referring to above).
If we don't provide such metadata, there will be a high risk associated to people creating their own HDU index tables potentially using wrong IRF <-> events associations.
What is/could be missing is metadata within the IRF components describing to which events they may be applied to
Yes, but this is a much harder problem that depends on pretty much every detail of how these IRF files were produced and what thre required systematic uncertainties are (E.g. I can use a barely matching IRF file to get a rough idea but need to use a file very specific to the run conditions etc. for best possible results).
So I think this is a problem that has to be solved by the analysis chains producing the IRF files and HDU index files for a given analysis goal. It can probably not really be addressed by this standard.
I agree with you Max: This is something that definitely needs the input from lower-level (provenance) experts. Certainly should not be discussed here.
@fabiopintore please close the issue if you were requesting a similar use case as the HDU tables described be @maxnoe
As said, at the moment there are two ways to associate EVENT tables with their IRFs:
- By being in the same file
- By using the HDU-INDEX table to give each EVENT its IRFs by using the OBS_ID of the EVENTS table.
I think this is not sufficient. The issue came up in the context of updating the Gammapy tutorials to the latest alpha configuration of CTA IRFs. Doing this we noticed there is currently no simple header keyword or combination of header keywords, that would show which IRFs have been used for the simulation and thus to distinguish the event files created from different iRFs. One would either rely on naming the file correctly or users adding additional comments to the event header. Otherwise the information is just lost on the event list level, which is an issue I think.
I think only relying on the index table is not a good idea. The index files rely on filenames, which are arbitrary. This means if e.g. the index file you download is corrupted, there is not way to reconstruct the information from the data files themselves. In case of the observation index file, there is currently only one reason it exists at all: performance. All the information is present in the event header, but it is just to costly to open all event files for data selection. So the information is duplicated in an independent observation index table. I think a similar structure should apply to the HDU index. However I think writing hard-coded filenames to IRF files in the event header (such as in the OGIP standard) is a bad idea, because once data is copied to a different files system they might not be valid anymore. However some unique identifier or combinations of keywords will work well.
Also creating an hdu index file for simulations, where sometimes only one set of IRFs is used is probably "overkill". There are other situations, where such a keyword would be useful. E.g. for HAWC analysis all the entries in the HDU index table basically link to the same IRFs, because there are only different passes / versions of all-sky IRFs productions and not IRFs associated to OBS_ID
.
Yes, there is the OBS_ID
, but for simulated event lists this is just an arbitrary identifier chosen by the user. Also for real data there can be multiple reconstruction versions of the same data for the same OBS_ID
. In that sense it's not even unique, only with additional information. It would require at least some combination of OBS_ID
and a new keyword such as RECOPASS=pass3
or similar.
Just to add: I think it should be possible that an event file lives independently and provides all the necessary info that is need to analysis it. However the information does not have to be provided in a convenient way or a way that allows for performant data access.
@fabiopintore I think the event files simulated by ctools
include this information on the CALDB
, but as a custom non-GADF keyword. Is this correct?
I think this is not sufficient. The issue came up in the context of updating the Gammapy tutorials to the latest alpha configuration of CTA IRFs. Doing this we noticed there is currently no simple header keyword or combination of header keywords, that would show which IRFs have been used for the simulation and thus to distinguish the event files created from different iRFs.
This is different, only tangentially related issue. This standard at the moment is mostly if not exclusively concerned with the storage and analysis of observed data, not artifical simulations.
We have several issues about extending this (See true energy columns in EVENTS
for example: https://github.com/open-gamma-ray-astro/gamma-astro-data-formats/issues/30).
@maxnoe Yes, I mostly agree.
But it also applies to observed data in a broader context: Typically different reconstruction of the same data (e.g "pass6/pass7/pass8" in the Fermi context or different cuts for IACT data) are just stored in different folders with different HDU and OBS index files. However I think the information should be present in the event file header as well. One should not rely on the data being stored in a correspondingly named folders. Some unique identifier is needed for this and it goes beyond the current event type keyword.
The second use case on observed data is e.g. HAWC analysis, where there is no huge IRF database, but just time-independent all-sky IRFs to be used for all events. So just storing the event type and pass version is completely sufficient to link the events to the IRFs, that users might have already copied anyway, without needing to create and index file. Maybe @LauraOlivera can comment on this as well.
@adonath I agree there is a large room for improvement here, but as @TarekHC mentioned, there is no one-to-one mapping for observed data, so having a header card in the EVENTS specifying a single IRF is not ok.
I read only now all the discussion but I quite agree with @adonath . I understand that the standard format is intended for real data and not for simulated ones, but I think that event simulations still represent a not negligible part of the current gamma-ray activity. Since HDU index files are not provided by default if a user simulates them, at this moment these data would be completely unrecognizable if no info on the progenitor IRFs is given. My 2 cent on the issue is that we would need one (or maybe more than one) identificative keyword(s) in the header, although I agree that the one-to-one mapping cannot be always satisfied.
Hi all,
This is clearly a provenance issue, and unfortunately it will be strongly related to details of the lower-level analysis, which may be very different for different instruments. It won't just be the MC file used to simulate events... It will be the specific energy look-up table, the gamma/hadron separation matrix, etc...
I tend to agree with @maxnoe: having a header specifying a single IRF feels wrong to me.
We have the following keywords with the software version used:
Why not just doing exactly the same as done by LAT? Add to the list above an optional keyword PASS_VER
, with a string being the code name of the "pass"? Would this be enough @adonath @fabiopintore ?
The GADF format of the
EVENTS
does not take into account any keyword (neither mandatory nor optional) that relates the events to a given IRF (and corresponding calibrations). Here, it is suggested to add the keywordsIRF
andCALDB
: the former can provide the name of the IRF (or its MC production), while the second can give the configuration status (for example, time of the IRF optimisation).