key4hep / k4FWCore

Core Components for the Gaudi-based Key4hep Framework
Apache License 2.0
10 stars 26 forks source link

Memory leak in k4DataSvc for CaloHitContributions if not explicitly loaded #82

Closed wdconinc closed 1 year ago

wdconinc commented 2 years ago

We are experiencing a memory leak in the EIC framework which we can reproduce in the key4hep framework as well. We take the same approach to the PodioDataSvc, so that's not very surprising, but I'm reporting it here so someone with expertise in podio can take a look too.

Steps to reproduce

  1. spack install k4fwcore@1.0pre14 (or e.g. build an environment with k4fwcore@1.0pre14)
  2. git clone https://github.com/key4hep/k4-project-template, build, install
  3. cp k4ProjectTemplate/options/readExampleEventData.py k4ProjectTemplate/options/readEICEventData.py and apply the following changes to read in the collection 'EcalBarrelHits' for all events in an existing file:
    5,7c5
    < podioevent.input = "output_k4test_exampledata.root"
    < from Configurables import CreateExampleEventData
    < producer = CreateExampleEventData()
    ---
    > podioevent.input = "sim_barrel_sciglass_electron_1.0_10.0.edm4hep.root"
    11c9
    < inp.collections = ["MCParticles", "SimTrackerHit"]
    ---
    > inp.collections = ["MCParticles", "EcalBarrelHits"]
    16c14
    <                 EvtMax=100,
    ---
    >                 EvtMax=-1,
  4. Download sim_barrel_sciglass_electron_1.0_10.0.edm4hep.root (103 MB), generated by an unmodified DD4hep ddsim script against the EIC geometry.
  5. Back in build directory, run /usr/bin/time -v ./run k4run ../k4ProjectTemplate/options/readEICEventData.py.

Expected behavior

The value under Maximum resident set size (kbytes) should be independent of number of events (and about 500 MB)

Actual behavior

The memory use increases with number of events, and goes up to ~2 GB for this file (with 1000 events). Files with more than 10k events are not able to be run on typical systems due to memory limitations.

Workaround

The memory leak can be avoided by explicitly listing the linked collections, i.e. modifying the diff above to

11c9
< inp.collections = ["MCParticles", "SimTrackerHit"]
---
> inp.collections = ["MCParticles", "EcalBarrelHits", "EcalBarrelHitsContributions"]

in which case the EcalBarrelHitsContributions collection is properly deleted after each event.

This behavior may qualify as a bug since a data model may change without notice, and downstream options file will continue to work but appear to start leaking memory. It would be clearer if it just failed instead (or didn't leak the memory).

tmadlener commented 2 years ago

I think I figured out the underlying problem for this, but I am not yet entirely sure if there is an easy solution to it.

The problem is the following: PodioDataSvc uses podio::EventStore to read collections from file. The podio::EventStore keeps track of all collections it has read, including the ones that are implicitly read when resolving relations (i.e. the CaloHitContributions in this case). It also takes ownership of all collections it reads(!). Calling EventStore::clear disposes of all read collections. The PodioDataSvc on the other hand hands over ownership to the Gaudi EventStore (since the DataWrapper takes ownership of the wrapped podio::CollectionBase*). To make the podio::EventStore "aware" of this fact it calls EventStore::clearCaches:

https://github.com/key4hep/k4FWCore/blob/d6555ad5679fb10f194636a52b0bd5eee272749c/k4FWCore/src/PodioDataSvc.cpp#L70-L71

That clears the internal vector of the EventStore, which in turn means that only explicitly read collections are now handled by the PodioDataSvc and implicitly read ones are no longer reachable from the EventStore and leak.

Given this interplay between the two, I am not sure there is an easy fix. Since overhauling the whole I/O is on our list once we have a prototype of the Frame in podio my preferred solution would actually be to wait until then, unless this is something that cannot be worked around by simply listing all (implicitly) read collections.

wdconinc commented 2 years ago

Is there a way to get notified of implicitly read collections as a WARNING?

tmadlener commented 2 years ago

I don't think there is an easy way to do that, since implicitly and explicitly read collections go through the same mechanism in the EventStore (especially in the way in which it is used here in k4FWCore). I will have a second look to make sure I didn't miss anything though.

tmadlener commented 1 year ago

I think this should no longer apply with the introduction of Frame based I/O (#100). If this is still an issue, feel free to re-open.