Support running multithreaded: Add an IOSvc, an algorithm for reading and writing for functionals in k4FWCore

key4hep / k4FWCore

Core Components for the Gaudi-based Key4hep Framework

Apache License 2.0

10 stars 26 forks source link

Support running multithreaded: Add an IOSvc, an algorithm for reading and writing for functionals in k4FWCore #173

Closed jmcarcell closed 5 months ago

jmcarcell commented 9 months ago

BEGINRELEASENOTES

Add a new way of using functionals in k4FWCore such that it's possible to run using the Gaudi HiveWhiteBoard (multithreaded) that includes:
A new service IOSvc to take of the input and output, together with a Reader and Writer algorithm, to be used instead of PodioInput and PodioOutput
Support for reading and writing an arbitrary (specified at runtime) number of collections, using std::map<std::string, const Coll&> for reading and std::map<std::string, Coll>, i.e. for reading we get const references to the collections and for writing we put the collections using std::move typically in the map
Custom implementations of the Consumer, Producer and Transformer to make all of this possible and the interface simple (for example, if we have an algorithm that outputs a std::map<std::string, edm4hep::MCParticleCollection> we only have to set the names in python and all the collections will be in the store available for the next algorithms
Some utilities on the python side: automatic insertion of the Reader and Writer algorithms when an input or output files are specified, automatic creation of output folders (see https://github.com/key4hep/k4FWCore/pull/170)
Tests for several different situations: functionals reading and outputting to files, using only memory, running multithreaded, functionals with an arbitrary number of collections as input and/or output

ENDRELEASENOTES

Ready for review! During the coming days I'll test a few things: if there are memory leaks, metadata handling (most of it copied from what we were doing with PodioOutput), running multithreaded

Left out of this PR: writing multiple times in the same chain (multiple output files). That will be in a different PR, this one is already big enough.

[x] Needs https://gitlab.cern.ch/gaudi/Gaudi/-/merge_requests/1545 to be able to schedule stops when running multithreaded, but it's not a showstopper.
[x] Needs https://gitlab.cern.ch/gaudi/Gaudi/-/merge_requests/1576 to make sure the HiveWhiteBoard tests fail when there is a failure in the algorithms, otherwise the exit code is 0 when it should be something different from 0.

In addition, when https://github.com/AIDASoft/podio/pull/522 is merged I'll make the changes so that it's easy to swap between the normal ROOT, either in this PR or in a different one.

m-fila commented 8 months ago

Would it be possible to have Reader and Writer dynamically declare their data dependency and participate in a data flow (using property system instead of writing directly to the data store, Gaudi's CPUCruncher has it)? With that there shouldn't be a need for custom ApplicationMgr and additional wrapping in sequencers in order to ensure they are the first and last algorithm called.

Taken from FunctionalMTFile:

AvalancheSchedu...   INFO Data Dependencies for Algorithms:
  Reader
      none
  Transformer
    o INPUT  'MCParticles'
    o OUTPUT 'NewMCParticles'
  Writer
      none

The avalanche scheduler recognizes only the transformer to have data dependencies

====================================
Data origins and destinations:
====================================
Transformer
V
o 0x4045d28
V
====================================

jmcarcell commented 8 months ago

I left some comments on the service part

For the algorithm part. Previously the client algorithms were supposed to be derived from GaudiAlgorithms with custom Traits, now they are meant to be derived with provided k4FWCore::Consumer etc. Does it mean later all other flavors of functional algorithms (e.g. filters) should be added upon request? Or is there some incompatibility built-in?

Yeah they have to be added on demand. With the Producer, Consumer and Transformer in this PR it's most use cases, we're only missing filtering and maybe some utility functionals

m-fila commented 8 months ago

The algorithms using runtime specification of collections with std::map seem to be incompatible with AvalancheScheduler. Similarly to Read and Write algorithms they don't declare the runtime data dependencies, so the scheduler can't figure out the order in which they should be executed.

For instance, I changed ExampleFunctionalConsumerRuntimeCollections.py (attached) to use AvalancheScheduler and obtained an error:

36: Consumer            ERROR Consumer : Failed to retrieve object MCParticles2
36: Consumer            ERROR Maximum number of errors ( 'ErrorMax':1) reached.
36: AlgTask           WARNING Execution of algorithm Consumer failed
36: AvalancheSchedu...  ERROR *** Stall detected, event context: s: 0  e: 0

After inspecting the data dependencies:

Data dependencies:

INFO Data Dependencies for Algorithms:
  Producer0
    o OUTPUT 'MCParticles0'
  Producer1
    o OUTPUT 'MCParticles1'
  Producer2
    o OUTPUT 'MCParticles2'
  Consumer
    o INPUT  'InputCollection'

Data flow:

====================================
Data origins and destinations:
====================================
Producer0
V
o 0x25022f8
V
====================================
Producer1
V
o 0x25023d8
V
====================================
Producer2
V
o 0x2502488
V
====================================

The baseline version without AvalancheScheduler runs fine. The error disappears too in the version with AvalancheScheduler after wrapping the topAlg in Gaudi__Sequencer with Sequential=True . The data item that was previously missing now is present 'by accident', because the algorithms were put in correct order and executed sequentially.

I think it could be good to test multi-threading with multiple different algorithms instead of a two tests dedicated to multi-threading but the algorithm tests running with default scheduler

tmadlener commented 7 months ago

A question that just come to me while looking at #172: Does this still work with non functional algorithms?

jmcarcell commented 7 months ago

From @tmadlener

A question that just come to me while looking at #172: Does this still work with non functional algorithms?

It doesn't because support for it hasn't been implemented (but could be). I have added a few asserts so that now one gets a proper error message when using input / output types that are not part of the allowed ones. This has been discussed several times in the EDM4hep meetings and I think the agreement lean towards not letting algorithms use classes outside of EDM4hep or podio. If I remember well, this is the same that we have with the current support for functionals in k4FWCore.

tmadlener commented 7 months ago

I think the question about which types can be transported through the event store between algorithms is a separate concern to the question of whether we want to support non-functional algorithms. For the first one, I think having a limitation to only support EDM4hep / podio generated types is OK in the long run. I think we will probably need support for non-functional algorithms as well. At least at the moment we need to be able to run non-functional algorithms as well as to pass non-EDM4hep objects via the TES, due to the MarlinWrapper. There we need to pass an LCEvent (or a wrapper around that) along, and I don't think we can ever make it functional.

tmadlener commented 7 months ago

Another question that I just came across is how do we foresee the metadata handling here? Would that still simply work with the MetaDataHandle? Do the current MetaDataHandles even work well with Functional algorithms?

jmcarcell commented 6 months ago

A few answers to some questions From @m-fila

lgorithms using runtime specification of collections with std::map seem to be incompatible with AvalancheScheduler. Similarly to Read and Write algorithms they don't declare the runtime data dependencies, so the scheduler can't figure out the order in which they should be executed.

This is now fixed and the data flow graph looks correct with std::map

From @tmadlener (sorry I think I replied to another question in the reply to this):

A question that just come to me while looking at https://github.com/key4hep/k4FWCore/pull/172: Does this still work with non functional algorithms?

Yes, there are two tests runFunctionalMix.py and runFunctionalMixIOSvc.py that run functional algorithm and "old" algorithms together using both PodioInput and PodioOutput and showing that it works

Another question that I just came across is how do we foresee the metadata handling here? Would that still simply work with the MetaDataHandle? Do the current MetaDataHandles even work well with Functional algorithms?

Not implemented for now. MetaDataHandle is coupled to PodioInput and PodioOutput so it won't work here.

jmcarcell commented 5 months ago

Update: rechecked and tuned the tests that mix "old" and functional algorithms in this PR: the conclusion is that when using PodioInput and PodioOutput the collections produced by functional algorithms are not saved (logical, since PodioOutput has not been changed to do that). When using IOSvc all collections are saved, the ones produced by "old" algorithms and functional ones. When running there doesn't seem to be any other differences. To be tested with the Marlin wrapper

m-fila commented 5 months ago

Test FunctionalMTFile reports an ERROR:

33: Transformer          INFO Transforming 2 particles
33: IOSvc                INFO No context found in IOSvc
33: ApplicationMgr       INFO Application Manager Stopped successfully
33: NTupleSvc           ERROR ERROR while saving NTuples
33: ApplicationMgr       INFO Application Manager Finalized successfully
33: ApplicationMgr       INFO Application Manager Terminated successfully with a user requested ScheduledStop
1/1 Test #33: FunctionalMTFile .................   Passed    4.02 sec

Test FunctionalFile doesn't report any problems. I'm not sure if this is a real error, the file is saved properly

jmcarcell commented 5 months ago

I couldn't find any violations of data or control flow. A few warnings could be silenced by putting a correct value explicitly (comments below), but in general should be ready to go 👍

Having said that I found one thing that was a bit puzzling and different with respect to plain Gaudi and previous k4FWCore algorithms. This isn't a deal-breaker

With the current design all of algorithm's inputs and outputs must be declared as the lists, even if an algorithm operates on singular objects. For instance, a producer that takes nothing and produces a single collection has to be constructed with a list of keys for outputs (KeyValues) instead of a single key for output (KeyValue) as in plain Gaudi. All the provided keys are used in data dependency resolution but only the first key will be populated and put to an event store.

https://github.com/key4hep/k4FWCore/blob/312725d3164abe15826b3538d9e1c903b9b34fe5/test/k4FWCoreTest/src/components/ExampleFunctionalProducer.cpp#L28-L42

This is also reflected in the steering files:
* This would be a standard code for plain Gaudi producer but here will result in an error because `OutputCollection` must be a list
  ```python
   ExampleFunctionalProducer("Producer", OutputCollection="Foo")
  ```

* This is will execute correctly:
  ```python
   ExampleFunctionalProducer("Producer", OutputCollection=["Foo"])
  ```

* this will also execute correctly but produce only `Foo`. A scheduler will see the algorithm as producing both "Foo" and "Bar" collections. Ideally it shouldn't be even possible to put a second item:
  ```python
  ExampleFunctionalProducer("Producer", OutputCollection=["Foo", "Bar"])
  ```
While this is could be solved with a convention, I'd say it can be misleading for algorithm developers. The documentation on functional algorithms is already a bit cryptic and now we have algorithms that have signatures for singular object in input/output but in property system declare operating on lists of objects

That is indeed something that isn't very nice about the current way. I'll have a look at how to change that.

As discussed in the past meetings, we're merging this now which has gotten already quite big and issues will be fixed in other PRs.