Data structures are difficult to handle

juanangp commented 2 years ago

Current code design makes difficult to handle data structures, e.g. in TRestHits.h

class TRestHits : public TObject {
   public:
    Int_t fNHits;         ///< Number of punctual energy depositions, it is the length
                          ///< for all the array
    Double_t fTotEnergy;  ///< Event total energy

    std::vector<Float_t> fX;          // [fNHits] Position on X axis for each punctual
                                      // deposition (units mm)
    std::vector<Float_t> fY;          // [fNHits] Position on Y axis for each punctual
                                      // deposition (units mm)
    std::vector<Float_t> fZ;          // [fNHits] Position on Z axis for each punctual
                                      // deposition (units mm)
    std::vector<Float_t> fT;          // [fNHits] Absolute time information for each punctual deposition
                                      // (units us, 0 is time of decay)
    std::vector<Float_t> fEnergy;     // [fNHits] Energy deposited at each
                                      // 3-coordinate position (units keV)
    std::vector<REST_HitType> fType;  //

Y would suggest to move to a different data structure e.g.

  class TRestHit : public TObject {
    Float_t fX;
    Float_t fY;
    Float_t fZ;
    Float_t fT;
    Float_t fEnergy;
    REST_HitType fType;
   enum REST_HitType { unknown = -1, X = 2, Y = 3, Z = 5, XY = 6, XZ = 10, YZ = 15, XYZ = 30 };
  };

 class TRestHits : public TObject {
   public:
     //fNHits is not needed anymore, it is just hitArray.size( );
     Double_t fTotEnergy;  ///< Event total energy
     std::vector<TRestHit> hitArray;          // Array of hits

This would simplify the handling of TRestHits using standard std iterators, for instance this part of the code mimic std iterators:

  class TRestHits_Iterator : public std::iterator<std::random_access_iterator_tag, TRestHits_Iterator> {

The implementatio of this proposal would imply a lot of changes in the code and perhaps the root files might loose the backward compatibility, but it will be a big improvement that would simplify the code.

Let me know what do you think, I can help with the migration.

lobis commented 2 years ago

Hello @juanangp,

I also thought of this in the past, but after trying to refactor them to a structure like what you are suggesting I discovered why they are defined like that.

I don't remember the details but there were some (I thought unsolvable) issues when storing the data in a TTree, it seems it doesn't support such a depth level.

Currently we store a vector of tracks for each event, where each track has a single member which is a hits (plural). I tried to instead have each track have a vector of hits but as I say ROOT doesn't like it.

I think in principle it may be possible to do thanks to dictionaries, but it will probably present some issues, such as making our data structures unreadable without using dictionaries (which is annoying if you want to analyze this data in python using uproot for example). Currently the event tree is readable without having to load REST libraries thanks to the default max split level on the tree, but it stops working once you "put a vector inside a vector". I also asked for a solution on the ROOT forums but didn't get a convincing answer (basically to just use dictionaries).

So I would recommend to keep the current structure, but I also agree that I can be improved in some aspects.

My suggestions coming from experience implementing these structures from scratch for Geant4 simulation code:

Remove TObject inheritance: In my experience it only introduces annoying data members such as TBits (or however its called) and provides nothing useful, it is not necessary to inherit from TObject to save into a ttree. Also you don't expect to save a hits or a track object into a file, there are always inside an event, in this case its enough with the top-most classs inheriting from TObject.
Remove analysis methods from the data structures: I think the data structure definitions should remain as simple as possible. Analysis methods (such as getting the mean position of a track etc.) should reside in a different class. For example you could store a TRestTrack in a TRestEvent but when you perform a TRestEvent::GetTrack call it could return a TRestAnalysisTrack so that you can transparently perform analysis on the terminal?
Store information using TString instead of numerical IDs. Since the tree is going to be compressed the size increase of the file is not significant and it helps with readability and avoids possible mistakes when translating the IDs.

Perhaps @jgalan wants to join the discussion.

jgalan commented 2 years ago

Any improvement on this sense is really welcome. This big change will probably break forward/backward compatibility. So we could move towards a new 2.4.X release in that case.

But who knows, perhaps ROOT schema evolution works as expected, to be checked.

I am just afraid of the work load (gitlab pipelines should be kept running and alive). If you want to invest on that part of the code then you could give a try.

Probably @nkx has few insights to decide if moving forwards is a wise decision.

juanangp commented 2 years ago

Hi @lobis,

I don't understand when you said that ROOT doesn't like to "put a vector inside a vector", this is what we are actually doing e.g. in TRestTrackEvent:

class TRestTrackEvent : public TRestEvent {
    std::vector<TRestTrack> fTrack;  // Collection of tracks that define the event

Where TRestTrack:

class TRestTrack : public TObject {
    TRestVolumeHits fVolumeHits;  ///< Hit volumes that define a track

Then TRestVolumeHits inherits from TRestHit (see above) and also has further vectors:

class TRestVolumeHits : public TRestHits {
   protected:
    std::vector<Float_t> fSigmaX;

Anyhow, I agree with you that the data structures should be readable outside REST, however it shouldn't be a problem. Looking a bit into ROOT looks like is possible to split a std::vector into subbranches, that might make the data readable without a dictionary: https://root.cern/manual/trees/

On the other hand, I agree with your other suggestions.

In summary, I think we can give a try although I am afraid we will loose backward compatibility of the data.

lobis commented 2 years ago

Hi @lobis,

I don't understand when you said that ROOT doesn't like to "put a vector inside a vector", this is what we are actually doing e.g. in TRestTrackEvent:

This is the post where I asked this question: https://root-forum.cern.ch/t/storing-nested-structure-event-track-step-into-tree/45660/11

I have worded my reply poorly, what I meant was that if we use something like:

class Event:
   vector<Track> fTracks;

class Track:
   vector<Hit> fHits;

class Hit:
   TVector3 fPosition;
   ...

It will arise some problems (or that was my experience, I don't understand why though, maybe I did something else wrong, it may be worth it to try.

I came to the conclusion that a structure like the following is the one that will work:

class Event:
   vector<Track> fTracks;

class Track:
   Hits fHits;

class Hits:
   vector<TVector3> fPosition;
   ...

but now that I see it I understand less why I had this problem, they seem equivalent. In my forum post you can see that you cannot see the complete structure from the TBrowser without loading the dictionaries, whereas with the first proposal you can. Maybe just enabling a setting on the TTree when writing (such as the split level) you can make it work... I don't know.

nkx111 commented 2 years ago

Remove analysis methods from the data structures: I think the data structure definitions should remain as simple as possible. Analysis methods (such as getting the mean position of a track etc.) should reside in a different class. For example you could store a TRestTrack in a TRestEvent but when you perform a TRestEvent::GetTrack call it could return a TRestAnalysisTrack so that you can transparently perform analysis on the terminal?

I don't mind adding simple analysis methods in the data structures. The methods to calculate things like mean position or RMS is very universal. The algorithm will never be updated. Adding these methods directly in the data structure expands its capability. For other advanced analysis like doing Hough transformation they shall indeed be placed in other classes.

The problem for me is that there are too much repeated method in TRestDetectorHits and TRestDetectorHitEvents. For example, TRestDetectorHitsEvent::GetNumberOfHitsY() is just calling TRestDetectorHitsEvent::GetNumberOfHitsY().

jgalan commented 2 years ago

One could avoid that by accepting that the user will call detectorEvent->GetHits()->GetNumberOfHitsY()?

Just removing the methods at TRestDetectorHitsEvent would force us to write the code that way.

rest-for-physics / framework

Data structures are difficult to handle #123