Event classes and tagging variables (as hadronness)

contrera commented 8 years ago

Proposal: keep the format general enough so we can compute the flux of a source (or even a diffuse flux) either from the number of events surviving a cut in or from a fit to a “hadronness” distribution. I argue that this later method should be more performant at high contamination (low separation power) because in principle all events are kept and the whole information of the differences in hadronness pdfs is used. I also think this may be a road to work in cases where our separation power is small, like electron analysis.

Use case: we want to be able to compute the flux of a source using both methods

Case 1: Using cuts:

Define a set of event classes corresponding to different cuts in hadronness
Compute the effective area for each set of cuts Aeff(cut)
Compute the background for each set of cuts from data N_bckg(cut)
Count the number of events passing the cuts N_evts(cut)
Compute the signal events N_sig(cut) = N_evts(cut)-N_bckg(cut)
Flux(cut) = (N_sig(cut))/(Aeff(cut)*time) Required: Class definitions, class for each event, Aeff(Cut), Backg(cut)

Case 2: Fitting the distributions:

Keep the hadronness of each event
Compute the pdf for the signal hadronness_pdf(sig)
Compute the pdf for the background hadronness_pdf(bckg)
Fit the hadronness distribution of the sample to N_sig_hadronness_pdf(sig)+N_bckg_hadroness_pdf(bckg)
Flux = N_sig/(Aeff*time) Required: hadroness for each evt, hadroness_pdf(evts), hadronness_pdf(bckg)

Note: A soft hadronness cut will be needed in any case to reduce the DL3 size

cdeil commented 8 years ago

@contrera - Thank you for the proposal!

I think it would be nice to support more event info and those use cases in IACT DL3. For me the way forward would be a pull request proposing changes and then discussing which extra info should be required or optional.

@jknodlseder @TarekHC, maybe @woodmd with the Fermi-LAT perspective - Thoughts?

TarekHC commented 8 years ago

Hi all,

I think there should be a natural way of adding this kind of "lower level" variables. Also, EVT3 should contain events surviving a set of cuts, which could be also flexible on demand (providing a list of possible cuts to the proposals, defined by the use cases we may think of). Adding new sets of cuts to this list should also be natural, when different use cases appear (for example, no hadroness cut). Then, they should also be included into the IRFs as new event classes (additional axes?).

Maybe we could add to the specs some words on the flexibility of the format. How additional columns could be easily added (both within the event list and as IRF axes) giving as an example Fermi's event class.

cdeil commented 6 years ago

Just a note: @MaxNoe made some comments related to this issue in #119 #120 and especially #121 . I closed all of those, so discussion should continue here.

cosimoNigro commented 3 years ago

Hello, could we restart this discussion?

What would be the argument against having an additional column representing the outcome of the event classification (a gammaness or hadronness)?

I think any gamma-ray astronomical instrument discriminate gamma-rays from hadrons with some classification algorithm.

I would add it as optional column, we have plenty of them that are IACT-specific (COREX, HIL_MSW) and that I think are never used (nor read by the science tools). For sure this would be a more useful and general one.

Thoughts?

micheledoro commented 3 years ago

Aside comment maybe. I prefer gammaness because the background can be made not only of hadrons (electrons, heavier nuclei)

cosimoNigro commented 3 years ago

Remember we have only 8 characters for FITS keywords though, so it should be GAMMANES or something like that :grin:

jknodlseder commented 3 years ago

I think you are looking for a column name, not a keyword. FITS column names can be more than 8 characters.

Le 29 janv. 2021 à 10:56, Cosimo Nigro notifications@github.com a écrit :

Remember we have only 8 characters for FITS keywords though, so it should be GAMMANES or something like that 😁

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/open-gamma-ray-astro/gamma-astro-data-formats/issues/34#issuecomment-769703898, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAW2QV37DL3XDIUVIKUAA6DS4KA6XANCNFSM4CALPD2Q.

TarekHC commented 3 years ago

Hi all,

I think it would indeed make sense to add an optional gammaness column to the event lists, and I fully support it. Although we encounter again with the classical problem: if we add that column, we would need IRFs also evolving in bins of gammaness (including of course the background model).

This is a problem that will come again and again: there are variables that, in order to be useful, they need to be added both to the event lists and to the IRFs as an additional dimension (event types, gammaness, zenith, azimuth, etc...), and science tools right now (as far as I know, but @jknodlseder and @adonath may confirm) are not ready to support (their shape is hard coded, as far as I know).

Could we find a general way to define (and specially) implement such variables? I personally feel this repo needs to describe them, to at least guide a future implementation by science tools.

cosimoNigro commented 3 years ago

I think you are looking for a column name, not a keyword. FITS column names can be more than 8 characters.

Thanks @jknodlseder, I didn't know.

there are variables that, in order to be useful, they need to be added both to the event lists and to the IRFs as an additional dimension

I think it will take a while to have the IRF defined also as a function of the gammannes. Note I am proposing to include this as an optional column (like optional are all those Hillas, IACT-related features we just ignore). If some corresponding information has to be included in the IRF can we just add a header keyword? If I generate an IRF with a theta2 cut I will store all the EVENTS RA and DEC (so all the events thetas) anyhow and add RAD_MAX keyword in the header. Can we do the same for the gammanness. We store all the events gammaness and then we add a GAMMACUT (or something like that) header keyword. What do you think?

maxnoe commented 3 years ago

@cosimoNigro We would also need a table for energy dependent cuts.

cosimoNigro commented 3 years ago

@maxnoe we can start with a fixed hadronness cut (as we do now for theta2 or RAD_MAX), no? At the moment we do not support energy-dependent theta2 cuts either.

maxnoe commented 3 years ago

@cosimoNigro We already have a RAD_MAX table for energy / fov dependent cuts. I don't see a reason why we should restrict us to a global gammaness cut, we know its not sufficient.

Docs: https://gamma-astro-data-formats.readthedocs.io/en/latest/irfs/point_like/index.html#rad-max-2d

pyirf io: https://github.com/cta-observatory/pyirf/blob/154ad7535ddb34dc9d94cfd4ffc499ed1911f753/pyirf/io/gadf.py#L278

TarekHC commented 3 years ago

Sorry, but I don't understand:

Why would you store the gammaness of each event, if all you want is to store a gammaness cut? The only reasonable reason to store the gammaness as an events table column at DL3 level would be to actually allow science tools to use that parameter. And if no IRFs vs gammaness are stored, then we don't need that parameter in the events table. Of course maybe I'm mistaken, can you please tell me a use case in which science tools would need that gammaness per event (not requiring IRFs vs gammaness)? The only one I can think of would be e.g. pulsars analysis, but if a detection is made (no need for IRFs for just a detection) you would want to calculate a spectrum, and then you would of course need IRFs.
If it is just a matter of provenance (knowing the gammaness cut vs energy that was used for that DL3 run) then I agree we would need to store it somewhere (but probably no need to add it to the events table). Although there is a key difference with, for instance, the theta2 cuts: science tools do need to know the theta2 cut that was used to generate the IRFs (in order to use the same cuts, as theta2 is indeed used in the analysis as it is a known parameter per event). Unless I'm mistaken, you may use complex gammaness cuts vs energy, and as long you are consistent between all your runs, MCs calculating IRFs and OFF data, then science tools would not need to know those cuts at all.

So, long story short: first decide if we want to add the parameter for a specific high-level analysis or just for provenance. Once we know what we want, then we may find a suitable solution.

adonath commented 3 years ago

I agree with @TarekHC here. I think the "gammaness" will never be a parameter that users can freely cut on for a high level analysis. I think it's rather going into a direction of predefined event classes (as Fermi-LAT does...) optimised for different analysis scenarios. Except if users possibly produce the IRFs themselves as well at some point... As @TarekHC noted the challenge is to deliver the corresponding IRFs for each selection as well.

From a science tools perspective I think event classes are typically not analysed jointly, so it's a choice made at the beginning of the analysis, such as the energy range and that's it. In that sense the concept of "event types" (using Fermi-LAT terminology) is more relevant as it requires to jointly fit multiple types of events (such as PSF classes etc.) together and I think both science tools are ready for this.

In general I agree, that the event list can contain (optional) additional information, that is not necessarily used for an analysis, but is useful for e.g. debugging and diagnostic plots. The only argument against it would be that the information could be mi-used by users...

maxnoe commented 3 years ago

Yes, indeed. You do not to know gammaness cuts for the analysis of already selected events for DL3 data. However, for provenance it might be good to define it in the standard.

cosimoNigro commented 3 years ago

Hi @TarekHC, @adonath,

sorry for the late reply.

can you please tell me a use case in which science tools would need that gammaness per event (not requiring IRFs vs gammaness)?

I have at least 2:

In this paper they included the information of the PSF (i.e. theta2 distribution) in the significance estimation. @micheledoro is working on a similar method but using also the gammaness distribution of the events also. You basically require that the source events follow both the theta2 and gammaness distribution of MC events;
In this other paper events that are normally discarded by a cosmic-ray rejection criterion (i.e. events above a standard gammaness cut you might have applied) are used to build a "template background". Instead of estimating the background from another region of the camera you would use photons in the same region of the source but with a different gammaness value.

Both are science cases related more to the estimation of the excesses and their significance (no IRF involved), but I think valid to consider. Especially if they can be enabled by adding an optional column.

Let me know what you think.

adonath commented 3 years ago

Thanks @cosimoNigro for the references!

The main point in my previous comment was, that I think the selection on "gammaness" will not be part of the "standard analysis" that is typically done by users. By "standard analysis" I mean flux measurements, based on likelihood fitting (1D, 2D, 3D, etc.). Letting users freely cut on the "gammaness" in these analysis scenarios, introduces the challenge of delivering the corresponding IRFs for a given cut. "Logistically" I think this can only be solved with pre-defined cuts (aka "event classes") or giving the user the possibility to produce their own IRFs (which might be a good option for some special analysis use-cases!).

The references you linked I would classify as "non-standard" use cases, I think the first method is basically a way to derive a PSF model from data, which might be interesting for cross-checks of simulated IRFs, or for analyses where no IRFs are available yet. But I'm not sure it's something a "normal", user would do. However once the user selects on the "gammaness" parameter, the analysis is basically limited to significance estimation and source detection, which I think limits the amount of (astrophysically...) useful analyses a lot.

In general I don't have any objections at all to include a "gammaness" column in the event list format as long as it's optional. But I also think it will probably never be "natively" supported by the science tools, which is also fine I guess. I completely agree, that advanced users should have the freedom to read an event list "by hand" and do any kind of special analysis they want.

TarekHC commented 3 years ago

Thank you @cosimoNigro for those use cases, they are indeed valid cases to be considered for discussion. I remember some of these were discussed in the past when defining DL3 specs (and I add some comments below). Other use cases discussed were cosmic ray abundance studies or electron spectrum measurements, which will obviously also need other kind of datasets, simulations, etc... And that does not mean we need to accommodate DL3 to also allow such science cases (as there was consensus that those analyses would be done internally by the collaboration, and perhaps in the future released with their own data format).

But I essentially agree with @adonath. If we want to add ´gammaness´ as an optional column I'm ok with that. But it should be noted that it would go slightly against the current DL3 "philosophy", and I doubt science tools would actually use it any time soon. In the end, it is a matter of definition: DL3 has always been defined as to contain "gamma-like events", as an equilibrium of simplicity and available use cases.

I see it equivalent to Fermi-LAT data format: with time it evolved to include more event parameters (e.g. additional event types), which significantly improved sensitivity and resolution. But they never allowed external LAT members to play around with those until they were confident to include them in their standard data products.

In this paper they included the information of the PSF (i.e. theta2 distribution) in the significance estimation. @micheledoro is working on a similar method but using also the gammaness distribution of the events also. You basically require that the source events follow both the theta2 and gammaness distribution of MC events;

Yes, indeed, you could use it as an attempt to improve your sensitivity. But only allowing analyzers to detect a source with one method, and not allow them to go any farther than that (no flux estimation?) would probably not be very useful.

If such an approach was indeed explored by CTAO/CTAC, and proven to work, then we would need to add gammaness to both the event lists (to apply the method to detect the source with improved sensitivity) and to the IRFs (to estimate a flux with that same method). But for the moment, as @adonath was saying, such an analysis will probably be non-standard.

In this other paper events that are normally discarded by a cosmic-ray rejection criterion (i.e. events above a standard gammaness cut you might have applied) are used to build a "template background". Instead of estimating the background from another region of the camera you would use photons in the same region of the source but with a different gammaness value.

Yes, I remember this one, and it is a very good example of what you could (because you don't need IRFs vs gammaness). I'm guessing CTAO or CTAC would apply this method internally (with all the data available) to generate background models, and store these as DL3 (background models can indeed be considered IRFs themselves). It could perhaps make sense to allow external CTA people to actually be the ones in charge of implementing such methods... But that could in the end be a double edge sword...

maxnoe commented 3 years ago

If we add that column, we would need IRFs also evolving in bins of gammaness (including of course the background model).

I think this is not true. Just adding this column in a well-defined format as optional column would already help many people and tools to start building things around it. We don't have to immediately also specify it as a possible IRF axis.

So let's just add this column to the list of optional columns for EVENTS with a sane definition.

E.g. like this:

* ``GAMMANESS`` type: float
    * Classification score of a signal / background separation. SHOULD be between 0 and 1. Higher values MUST indicate
    larger confidence that this shower was induced by a gamma ray.

We have stuff like Mean Scaled Hillas Width in this table. I think gammaness is much more helpful and much more universal than these columns.

So we should either think about removing all those optional columns. No sciencetool should anyway complain about additional columns that users add.

But having definitions for the most common, high-level information like the g/h score is good I think.

open-gamma-ray-astro / gamma-astro-data-formats

Event classes and tagging variables (as hadronness) #34