lorenzetti-hep / lorenzetti

Lorenzetti: Empowering Physics Performance and Analysis with Low-level Calorimetry Data
GNU General Public License v3.0
7 stars 13 forks source link

AOD reconstruction crash #29

Open laurab222 opened 1 week ago

laurab222 commented 1 week ago

Hello,

I'm currently trying to do the AOD reconstruction stage with two relatively large data sets of almost 10k events (one with and one without pileup). However, the reconstruction always gets killed around 180-200 events, because there is no more available memory space. I checked the memory usage of the various parts involved in the AOD reconstruction. The attached screenshot shows the memory usage for different steps: ESD 1.4 is at line 166, ESD 1.45 at line 181 and ESD 1.5 at line 194. memory_log Basically, it gradually increases with every event, though I identified two main places where the memory usage increases the most:

  1. In the ESD Reader: in the 'deserializing CaloDetDescriptor' part (https://github.com/lorenzetti-hep/lorenzetti/blob/master/reconstruction/io/RootStreamBuilder/src/RootStreamESDReader.cxx#L168), this is the major reason for the memory increase
  2. In the AOD Maker: in the serialize method though this only causes a slight increase in memory memory_log

I also checked out some older lorenzetti versions to maybe locate when this issue was introduced (because I remember not having these problems when working with lorenzetti in summer), but the issue persisted even for old git commits (e.g. commit id 4dc7ee57175824c903605bd1c28d364f76fc80b0).

This issue was observed by several people running lorenzetti on different devices.

As a temporary solution, I'm currently only reconstructing 150 events at a time (I introduced an index argument in the reco_trf.py script), iterating over all of the 10k events and this works fine.

Thanks! Laura

jlieberm commented 6 days ago

Hi, I just tried an even older version (did git checkout a0590c148dbc1a2ae3a468a3e936ca8eb2545ed9) and still get the process killed. By any chance, this could be a problem with version of g++ compiler or something? Did something was updated in the docker image? What do you think @micaelverissimo @jodafons ?

If nothing happened in this sense, one thing that we can workaround is to move the sampling and detector information into the CaloCell object, since only this information is used by CaloClusterMaker algorithm, which is the reason why ESD Reader algorithm deserialize the CaloDetDescriptorContainer (and also not dump cell information on the final AOD file)