legend-exp / legend-data-format-specs

LEGEND Data Format Specifications
https://legend-exp.github.io/legend-data-format-specs/
Other
0 stars 2 forks source link

Simulation Format Naming Convention #1

Open lmh91 opened 3 years ago

lmh91 commented 3 years ago

Follow up discussion (https://indico.legend-exp.org/event/477/) for the naming convention of the simulation file format including detailed definition what is saved at different stages and how.

lmh91 commented 3 years ago

I summarized my "proposal" from the call. I modified already some names. The column names are just something to start with.

3 Stages of Simulation Data

1. Stage: mcraw -

Output of Monte Carlo (MC) - Input to SSD / FieldGen+SigGen

Here, each row contains the information about a single hit (energy deposition). Group/Column Name Data Format (each row) Dimension Description example
mc_evt_id integer 1 MC Event ID 1
det_id integer 1 MC Detector ID of the detector in which the hit (energy deposition) occurs 1
pos 3 element vector length Position of the hit [0.01, 0.02, 0.014] * u"m"
edep float energy Deposited energy at the hit position 1460.0 * u"keV"
thit float time Timestamp of the hit 124.0 * u"s"

Time cluster by about 1ns (all energy depositions within 1ns will be drifted together). The mean of the individual time stamps (clustered together) will be the timestamp of the germanium-detector-event in the next stage of the simulation.

2. Stage: mcpss

Output of SSD / FieldGen+SigGen - Input to Electronics+DAQ

Group/Column Name Data Format (each row) Dimension Description example
det_evt_id integer 1 Detector Event ID of the germanium-detector-event. This event can be build up from multiple MC hits 1
det_id integer 1 Detector ID of the detector 1
chn_id integer 1 Channel ID to which the waveform belongs 1
pos vector of 3-element-vectors length Stores all hit positions from which the waveform was generated [[0.01, 0.02, 0.014] * u"m"]
edep vector of floats energy Stores all hit energies from which the waveform was generated [1460.0 * u"keV"]
thit float time Timestamp of the germanium-detector-event. This might be the mean of the MC timestamps of the individual hits from which this waveform was generated. This will be used for event building in the next stage. 124.0 * u"s"
waveform LegendWaveformFormat time and Charge Generated Waveform for the respective channel and hits. These waveforms have arbitrary lengths (due to different drift times). They should be in units of charge in my opinion (deposited energy / ionization energy of germanium) 0:4:20000 * u"ns" and rand(5000) * u"C"

3. Stage: t1pss (mimics tier1 v01.00 real raw data format)

Well, this is already fixed. We just could add some additional information like true_energy.

sagitta42 commented 3 years ago
  1. mcraw The group/column names of the MC input to SSD seem to be: evtno, detno, thit, edep, pos Not sure how important it is to have these exact names, but that's how SSD seems to be accessing them.

  2. mcpss I would suggest two groups on the top level: raw (where the simulation results are stored) and mctruth

  3. t1pss Similarly, three groups: raw (mimics tier1), mctruth (propagated from mcpss) and etruth (electronics and DAQ simulation parameters)

jasondet commented 3 years ago

Hi Mariia (@sagitta42), all,

I'd like to suggest an alternative to the proposed naming convention. It would be nice I think if there were parity between the naming schemes for MC and data. In data we do not have "tier1" and people will wonder whether that means daq (the first form of recorded detector data), raw (the first type of data for analysis), or dsp (if raw is considered "tier 0") until they figure out it's meant to correspond to raw. Then they might be confused again because "mcraw" is something else entirely.

In the data we have the tiers

[daq] raw dsp hit evt

where daq is in brackets because we anticipate to delete it once the raw tier is generated. So if we have MC generate a file that is meant to be identical in structure to our "raw" data from the detectors, I think that is what we should call "mcraw." Then the dsp file generated from that would be "mcdsp", and so on.

In that case, the file you proposed to call "mcraw" would need a new name. What it contains is "stepping information" from the simulations. I think there is value in keeping the field widths uniform in the file names, so I propose to use the 5-character label "mcstp" for that. Note that "mcpss" conforms to this scheme already.

So I suggest to use the following names for sim tiers:

mcstp mcpss mcraw mcdsp mchit mcevt

where G4 is used to generate mcstp, PSS is used to generate mcpss, electroncs + daq sims generate mcraw, and pygama generates the subsequent tiers. Further, I think we should structure mcpss so that tables from that can be joined row-by-row with tables from mcpss to mchit. In that case, the fields det_id and chn_id are not necessary because the former will be in the hdf5 group name, and the latter will be in the channel map.

To build events the mcpss step will have to generate a time coincidence map as well (which may be further modified by the daq sim). I hope to get the version for data prototyped soon so that you can see what it's meant to look like. In the meantime, refer to the data handling doc for details: https://docs.legend-exp.org/index.php/f/112233 (click on L200DataHandling_v3.pdf). Perhaps we should update this doc with these choices for sims once they are agreed upon.

Note that the fact that there are two "tiers" of MC data before one gets to "tier 1" for the data shows an example of why it is wise to use names rather than numbers to refer to tiers (thanks @oschulz!).

Best, Jason

lmh91 commented 3 years ago

Hi Jason (@jasondet),

I agree on your comment regarding possible confusion between mcraw and raw. But why do we actually need new names for raw and further produced stages of PyGama? I would suggest to stick with the 3-letter names (In the call also 5 was suggested but I do not remember the reason):

stp pss [daq] # skipped by Simulation (if really nothing is done here but data format conversion) raw dsp hit evt

As, in my opinion, PyGama should not know whether the data it is analyzing is simulated or real data. I would put this information rather in the names of the directories / files (so PyGama still could know though).

Regarding the structure of pss and hit. I don't think this is possible as between pss and raw there might be, as you also mentioned, some event building, e.g. possible coincidence hits, which would destroy this 1-to-1 mapping. The event building depends on the DAQ and SSD / SigGen does not know the DAQ when they produce pss. Depends on how we agree the event building is going to be handled. I would add IDs for each step to be able to trace back: stp_evt_id, pss_evt_id, raw_evt_id (just dummy names for now). So, in contrast to data, the raw group would have an additional dataset/column holding the corresponding pss_evt_id's and maybe also another dataset/column holding the corresponding stp_evt_id's.

My understanding of event building: SSD (SigGen) takes all hits from stp which are within, e.g., ~1ns and drift those hits together producing one waveform (per channel). This waveform(s) get one timestamp, e.g. the mean of the timestamps of all mc-hits from stp. There might be the case that two of such pss-timestamps are within a certain time interval, e.g. 200microns (several decay times of the electronics). Such pss waveforms would be merged (superposition before electronics) together to simulate Pile-Up or coincidence events. Thus, 2 waveforms (rows) of pss could result in one waveform (row) in [daq] / raw.

The fields det_id and chn_id can indeed be dropped if they are encoded in the file names.

sagitta42 commented 3 years ago

Hi @jasondet,

We were told to choose the same length for parsing reasons, corresponding to the 5-letter names "tier1", "tier2" and so on.

As for the group names, they will mimic those in data, as @lmh91 pointed out.

For example, mcraw is a simple h5 with energy, hit time and position etc. mcpss is a not-yet-tier1 file which is not supposed to be read by pygama, so we may name the fields as we like. So far we planned to name the groups raw (not yet as the one of tier1, but soon to be; it contains only channel, ievt and waveform) and mctruth. This stage is intermediate, and will not be saved in production.

After data format conversion (and in principle not only, we have been talking about a more electronics simulation), we obtain t1pss which completely mimics tier1, as @lmh91 said, indistinguishable from data to pygama. It contains groups raw (identical to data), mctruth and electruth for DAQ and electronics parameters used in the simulaiton.

After this point, there is no more simulation to be done. t1pss is fed to pygama, and the next tiers are obtained.

As for mcraw, I agree that it can be confused with raw in data, i.e. a complete tier1, while in our notation it stands for raw MC events. I agree that we should change this name, we will think about alternatives.

jasondet commented 3 years ago

Lukas (@lmh91) -- yes, I would be fine with dropping "mc" from all the names, as long as its clear elsewhere in a file key / path that the file contains mc, not data. And yes, you are correct, I had missed that in general there is no 1-to-1 mapping between pss and raw. I think putting pss_event_id in the raw file is a good suggestion that should be easy to implement. As for how event building will be handled in general -- for tiers prior to "evt" we will have a "time coincidence map" (tcm) that basically keeps lists of row numbers that correspond to raw/dsp/hit data from the same event. For the evt tier the data will already appear in a built structure that can be joined with the tcm for linking to the previous-tier data. I hope to have a prototype of the tcm ready soon so that people can see better how it will work.

Mariia (@sagitta42) -- I'm confused by your statement "We were told to choose the same length for parsing reasons, corresponding to the 5-letter names 'tier1', 'tier2' and so on." We do not use the names "tier1", "tier2" etc in LEGEND. We use the 3-letter names daq, raw, dsp, hit, and evt. See the Data Handling doc I linked in my last post or my Analysis Overview talk at the last CM for details.

I'm curious what data will be in the mctruth and electruth tables. We had envisioned having only one major group per file but I'm not opposed to adding more tables for mc. However I would have thought that parameters used in simulations should be in the simulation code / config files or in the database, and one wouldn't need to write them out repeatedly for each row in the output simulation. Maybe I misunderstood the proposal. In my mind the data in the stp and pss tiers -is- the mc truth.

oschulz commented 3 years ago

I fully agree with Lukas - we don't want to handle different tier names in the processing pipleline for physical and simulated data.

sagitta42 commented 3 years ago

Hi everyone,

I understand now. In that case, I think @jasondet's suggestion is the best: mcstp, mcpss, mc(raw) (indentical to data), and the rest follow the data format automatically through pygana - dsp, hit, evt.

Thank you for the great suggestion!

iguinn commented 3 years ago

Hey all, sorry for joining this discussion late, I wasn't following this repo before.

I have a few comments/questions, mostly about the steps tier.

  1. Just to make sure we're on the same page: The first step is taking ROOT MaGe outputs, performing the windowing process, and writing the steps organized into time windows and sensitive volumes?
  2. The timestamps from the MC tree aren't too meaningful (they represent time since the first decay in the chain). Is thit this directly, or would we be calculating this in the post-processing (which is also necessary to simulate pileup)? Do we also want to write some sort of MC truth for event building here?
  3. We have found that clustering is a very useful process for PSS, where you merge nearby steps. This can reduce by a factor of 2-3, on average, the number of pulses you need to generate for the full waveform. This is another process that could happen at many stages in this process, but it seems like it could potentially make sense to do this instead of steps in the first stage? This would reduce the data storage for these files as well.
  4. We also discussed the need to propagate MC truth values. Do we want to just have the raw step information or do we want to do some processing for this? For example, we might include the total energy in a sensitive volume. At some stage we also need to add energy adjustments (such as for quenching and dead layer effects), and we may also want to include other heuristic values. At what stage would we want to do this kind of thing?

Thanks, Ian

oschulz commented 3 years ago

Just to make sure we're on the same page: The first step is taking ROOT MaGe outputs, performing the windowing process, and writing the steps organized into time windows and sensitive volumes? The timestamps from the MC tree aren't too meaningful (they represent time since the first decay in the chain). Is thit this directly, or would we be calculating this in the post-processing

Indeed, and yes, one of the first steps in postprocessing of Geant4 (MaGe) output should be generation of realistic time stamps (given event rate parameters as additional input data). That way, we can also simulate things like pile-up, and test the ability of our analysis chain to deal with it.

We have found that clustering is a very useful process for PSS, where you merge nearby steps.

Yes, clustering is definitely another pre-processing step, like shown in the LEGEND Julia tutorial. Though we're now also looking into clustering less, for SSD, to have more detailed charge clouds when simulating charge-cloud self-interaction (still early days, and does of course come with a computational cost).

We also discussed the need to propagate MC truth values.

Yes, we're spoken quite a bit about this in the last pulse-sim call. Mariia and other are currently figuring out what exactly we need to propagate and in which data tiers.

At some stage we also need to add energy adjustments

Depending on whether the pulse-sim package accounts for this already (we're trying to teach SSD to do dead-layer effects, and I thing siggen can already do this to some extent), there should definitely a step of optional heuristics like that. Depending on how it's done (on the energy of waveform), it'll need to happen directly before or after pulse simulation, so ideally this will really be part of the pulse-sim packages themselves or the legend-specific wrapper code we'll use to call them.

sagitta42 commented 3 years ago

I guess we should update this :)

Currently used names are: g4s->stp->pss->(mc)raw. Maybe we could change g4s or g4 in the current code to dep or something (meaning, energy depositions), since, if I got it right, one could also use MaGe output as input to SSD or siggen.

oschulz commented 3 years ago

Yes, it shouldn't be something software/product specific, like g4 is

sagitta42 commented 1 year ago

This is quite outdated now. The current format uses three-letter names as in data processing: pet->stp->pss->raw.

pet stands for "position-energy-time" to not be specific to input (g4simple, MaGe, mpp).

The rest are the same as discussed here, and raw is simply raw - also because this way the hdf5 file with the field raw can be directly plugged into build_dsp or otherwise be processed with any DSP tools we currently have that work for data raw.