Open lmh91 opened 3 years ago
I summarized my "proposal" from the call. I modified already some names. The column names are just something to start with.
mcraw
-Output of Monte Carlo (MC) - Input to SSD / FieldGen+SigGen
Here, each row contains the information about a single hit (energy deposition). | Group/Column Name | Data Format (each row) | Dimension | Description | example |
---|---|---|---|---|---|
mc_evt_id |
integer | 1 |
MC Event ID | 1 |
|
det_id |
integer | 1 |
MC Detector ID of the detector in which the hit (energy deposition) occurs | 1 |
|
pos |
3 element vector | length |
Position of the hit | [0.01, 0.02, 0.014] * u"m" |
|
edep |
float | energy |
Deposited energy at the hit position | 1460.0 * u"keV" |
|
thit |
float | time |
Timestamp of the hit | 124.0 * u"s" |
Time cluster by about 1ns
(all energy depositions within 1ns
will be drifted together).
The mean of the individual time stamps (clustered together) will be the timestamp
of the germanium-detector-event in the next stage of the simulation.
mcpss
Output of SSD / FieldGen+SigGen - Input to Electronics+DAQ
Group/Column Name | Data Format (each row) | Dimension | Description | example |
---|---|---|---|---|
det_evt_id |
integer | 1 |
Detector Event ID of the germanium-detector-event. This event can be build up from multiple MC hits | 1 |
det_id |
integer | 1 |
Detector ID of the detector | 1 |
chn_id |
integer | 1 |
Channel ID to which the waveform belongs | 1 |
pos |
vector of 3-element-vectors | length |
Stores all hit positions from which the waveform was generated | [[0.01, 0.02, 0.014] * u"m"] |
edep |
vector of floats | energy |
Stores all hit energies from which the waveform was generated | [1460.0 * u"keV"] |
thit |
float | time |
Timestamp of the germanium-detector-event. This might be the mean of the MC timestamps of the individual hits from which this waveform was generated. This will be used for event building in the next stage. | 124.0 * u"s" |
waveform |
LegendWaveformFormat | time and Charge |
Generated Waveform for the respective channel and hits. These waveforms have arbitrary lengths (due to different drift times). They should be in units of charge in my opinion (deposited energy / ionization energy of germanium) | 0:4:20000 * u"ns" and rand(5000) * u"C" |
t1pss
(mimics tier1 v01.00
real raw data format)Well, this is already fixed. We just could add some additional information like true_energy
.
mcraw
The group/column names of the MC input to SSD seem to be: evtno
, detno
, thit
, edep
, pos
Not sure how important it is to have these exact names, but that's how SSD seems to be accessing them.
mcpss
I would suggest two groups on the top level: raw
(where the simulation results are stored) and mctruth
t1pss
Similarly, three groups: raw
(mimics tier1), mctruth
(propagated from mcpss
) and etruth
(electronics and DAQ simulation parameters)
Hi Mariia (@sagitta42), all,
I'd like to suggest an alternative to the proposed naming convention. It would be nice I think if there were parity between the naming schemes for MC and data. In data we do not have "tier1" and people will wonder whether that means daq (the first form of recorded detector data), raw (the first type of data for analysis), or dsp (if raw is considered "tier 0") until they figure out it's meant to correspond to raw. Then they might be confused again because "mcraw" is something else entirely.
In the data we have the tiers
[daq] raw dsp hit evt
where daq is in brackets because we anticipate to delete it once the raw tier is generated. So if we have MC generate a file that is meant to be identical in structure to our "raw" data from the detectors, I think that is what we should call "mcraw." Then the dsp file generated from that would be "mcdsp", and so on.
In that case, the file you proposed to call "mcraw" would need a new name. What it contains is "stepping information" from the simulations. I think there is value in keeping the field widths uniform in the file names, so I propose to use the 5-character label "mcstp" for that. Note that "mcpss" conforms to this scheme already.
So I suggest to use the following names for sim tiers:
mcstp mcpss mcraw mcdsp mchit mcevt
where G4 is used to generate mcstp, PSS is used to generate mcpss, electroncs + daq sims generate mcraw, and pygama generates the subsequent tiers. Further, I think we should structure mcpss so that tables from that can be joined row-by-row with tables from mcpss to mchit. In that case, the fields det_id and chn_id are not necessary because the former will be in the hdf5 group name, and the latter will be in the channel map.
To build events the mcpss step will have to generate a time coincidence map as well (which may be further modified by the daq sim). I hope to get the version for data prototyped soon so that you can see what it's meant to look like. In the meantime, refer to the data handling doc for details: https://docs.legend-exp.org/index.php/f/112233 (click on L200DataHandling_v3.pdf). Perhaps we should update this doc with these choices for sims once they are agreed upon.
Note that the fact that there are two "tiers" of MC data before one gets to "tier 1" for the data shows an example of why it is wise to use names rather than numbers to refer to tiers (thanks @oschulz!).
Best, Jason
Hi Jason (@jasondet),
I agree on your comment regarding possible confusion between mcraw
and raw
.
But why do we actually need new names for raw
and further produced stages of PyGama?
I would suggest to stick with the 3-letter names (In the call also 5 was suggested but I do not remember the reason):
stp pss [daq] # skipped by Simulation (if really nothing is done here but data format conversion) raw dsp hit evt
As, in my opinion, PyGama should not know whether the data it is analyzing is simulated or real data. I would put this information rather in the names of the directories / files (so PyGama still could know though).
Regarding the structure of pss
and hit
. I don't think this is possible as between pss
and raw
there might be, as you also mentioned, some event building, e.g. possible coincidence hits, which would destroy this 1-to-1 mapping. The event building depends on the DAQ and SSD / SigGen does not know the DAQ when they produce pss
.
Depends on how we agree the event building is going to be handled.
I would add IDs for each step to be able to trace back: stp_evt_id
, pss_evt_id
, raw_evt_id
(just dummy names for now).
So, in contrast to data, the raw
group would have an additional dataset/column holding the corresponding pss_evt_id
's
and maybe also another dataset/column holding the corresponding stp_evt_id
's.
My understanding of event building:
SSD (SigGen) takes all hits from stp
which are within, e.g., ~1ns and drift those hits together producing one waveform (per channel). This waveform(s) get one timestamp, e.g. the mean of the timestamps of all mc-hits from stp
.
There might be the case that two of such pss
-timestamps are within a certain time interval, e.g. 200microns (several decay times of the electronics). Such pss
waveforms would be merged (superposition before electronics) together to simulate Pile-Up or coincidence events. Thus, 2 waveforms (rows) of pss
could result in one waveform (row) in [daq]
/ raw
.
The fields det_id
and chn_id
can indeed be dropped if they are encoded in the file names.
Hi @jasondet,
We were told to choose the same length for parsing reasons, corresponding to the 5-letter names "tier1", "tier2" and so on.
As for the group names, they will mimic those in data, as @lmh91 pointed out.
For example, mcraw
is a simple h5 with energy, hit time and position etc.
mcpss
is a not-yet-tier1 file which is not supposed to be read by pygama, so we may name the fields as we like. So far we planned to name the groups raw
(not yet as the one of tier1
, but soon to be; it contains only channel, ievt and waveform) and mctruth
. This stage is intermediate, and will not be saved in production.
After data format conversion (and in principle not only, we have been talking about a more electronics simulation), we obtain t1pss
which completely mimics tier1
, as @lmh91 said, indistinguishable from data to pygama. It contains groups raw
(identical to data), mctruth
and electruth
for DAQ and electronics parameters used in the simulaiton.
After this point, there is no more simulation to be done. t1pss
is fed to pygama, and the next tiers are obtained.
As for mcraw
, I agree that it can be confused with raw
in data, i.e. a complete tier1
, while in our notation it stands for raw MC events. I agree that we should change this name, we will think about alternatives.
Lukas (@lmh91) -- yes, I would be fine with dropping "mc" from all the names, as long as its clear elsewhere in a file key / path that the file contains mc, not data. And yes, you are correct, I had missed that in general there is no 1-to-1 mapping between pss and raw. I think putting pss_event_id in the raw file is a good suggestion that should be easy to implement. As for how event building will be handled in general -- for tiers prior to "evt" we will have a "time coincidence map" (tcm) that basically keeps lists of row numbers that correspond to raw/dsp/hit data from the same event. For the evt tier the data will already appear in a built structure that can be joined with the tcm for linking to the previous-tier data. I hope to have a prototype of the tcm ready soon so that people can see better how it will work.
Mariia (@sagitta42) -- I'm confused by your statement "We were told to choose the same length for parsing reasons, corresponding to the 5-letter names 'tier1', 'tier2' and so on." We do not use the names "tier1", "tier2" etc in LEGEND. We use the 3-letter names daq, raw, dsp, hit, and evt. See the Data Handling doc I linked in my last post or my Analysis Overview talk at the last CM for details.
I'm curious what data will be in the mctruth and electruth tables. We had envisioned having only one major group per file but I'm not opposed to adding more tables for mc. However I would have thought that parameters used in simulations should be in the simulation code / config files or in the database, and one wouldn't need to write them out repeatedly for each row in the output simulation. Maybe I misunderstood the proposal. In my mind the data in the stp and pss tiers -is- the mc truth.
I fully agree with Lukas - we don't want to handle different tier names in the processing pipleline for physical and simulated data.
Hi everyone,
I understand now. In that case, I think @jasondet's suggestion is the best: mcstp
, mcpss
, mc(raw)
(indentical to data), and the rest follow the data format automatically through pygana - dsp
, hit
, evt
.
Thank you for the great suggestion!
Hey all, sorry for joining this discussion late, I wasn't following this repo before.
I have a few comments/questions, mostly about the steps tier.
Thanks, Ian
Just to make sure we're on the same page: The first step is taking ROOT MaGe outputs, performing the windowing process, and writing the steps organized into time windows and sensitive volumes? The timestamps from the MC tree aren't too meaningful (they represent time since the first decay in the chain). Is thit this directly, or would we be calculating this in the post-processing
Indeed, and yes, one of the first steps in postprocessing of Geant4 (MaGe) output should be generation of realistic time stamps (given event rate parameters as additional input data). That way, we can also simulate things like pile-up, and test the ability of our analysis chain to deal with it.
We have found that clustering is a very useful process for PSS, where you merge nearby steps.
Yes, clustering is definitely another pre-processing step, like shown in the LEGEND Julia tutorial. Though we're now also looking into clustering less, for SSD, to have more detailed charge clouds when simulating charge-cloud self-interaction (still early days, and does of course come with a computational cost).
We also discussed the need to propagate MC truth values.
Yes, we're spoken quite a bit about this in the last pulse-sim call. Mariia and other are currently figuring out what exactly we need to propagate and in which data tiers.
At some stage we also need to add energy adjustments
Depending on whether the pulse-sim package accounts for this already (we're trying to teach SSD to do dead-layer effects, and I thing siggen can already do this to some extent), there should definitely a step of optional heuristics like that. Depending on how it's done (on the energy of waveform), it'll need to happen directly before or after pulse simulation, so ideally this will really be part of the pulse-sim packages themselves or the legend-specific wrapper code we'll use to call them.
I guess we should update this :)
Currently used names are: g4s->stp->pss->(mc)raw
. Maybe we could change g4s
or g4
in the current code to dep
or something (meaning, energy depositions), since, if I got it right, one could also use MaGe output as input to SSD or siggen.
Yes, it shouldn't be something software/product specific, like g4 is
This is quite outdated now. The current format uses three-letter names as in data processing: pet->stp->pss->raw
.
pet
stands for "position-energy-time" to not be specific to input (g4simple, MaGe, mpp).
The rest are the same as discussed here, and raw
is simply raw
- also because this way the hdf5 file with the field raw
can be directly plugged into build_dsp
or otherwise be processed with any DSP tools we currently have that work for data raw
.
Follow up discussion (https://indico.legend-exp.org/event/477/) for the naming convention of the simulation file format including detailed definition what is saved at different stages and how.