root-project / root

The official repository for ROOT: analyzing, storing and visualizing big data, scientifically
https://root.cern
Other
2.71k stars 1.28k forks source link

Have rdfentry_ represent the global entry number in the TChain even in MT runs #12190

Open eguiraud opened 1 year ago

eguiraud commented 1 year ago

Currently RDF's special column rdfentry_, despite the name, does not correspond to the global TChain entry number in MT runs (see also the relevant docs).

This is surprising for users (hence the big warning in the docs linked above) and makes it unnecessarily difficult to e.g. attach a numpy array as an additional column (because it's hard to index into it correctly without stable global row numbers).

We could instead make rdfentry_ always match the "real" (global) entry number in the dataset -- if only each MT task knew the offset of the current tree w.r.t. all other trees in the chain.

Proposed solution

Other solution considered

pcanal commented 1 year ago

I suppose the proposed can work but require a synchronization step of sort. The thread handling file number n, must wait until the file number [0,n-1] have been open before starting to process entries ; the file opening can be in parallel (somewhat) but still if an arbitrary file inside the chain is much smaller than the other, the thread processing it will have to wait a bit.

eguiraud commented 1 year ago

In the proposed solution if a task needs an entry number that's not there yet it goes and retrieves it itself (subsequent tasks that need that entry number will then find it in the shared list).

pcanal commented 1 year ago

If you are not carefully, this might lead to the case where at start up for n-core/n-tasks, we might issue (n(n+1) / 2) files opens (eg. at the very least the first file being requested to be open n times).

pcanal commented 1 year ago

we could do nothing: rdfentry_ would be unstable and it could not be relied upon to e.g. index into manually added "friend columns"

Indeed, the global number is needed to load the proper friend. For example we could have a friend which is a chain which contains files that have different lengths (number of entries)( (but same total lengths) than the files in the main chain (consequently a single file in the main chain maybe have to use/open 2 or more files from the friend chain).

I.e. we would also need to keep a running total for the friends

Axel-Naumann commented 1 year ago

Sounds good! What might help - esp during startup where I agree with @pcanal things can get a bit wild - is to tell the workers: "and report your tree's entries back, and I will then - at some point in the near future - let you know your global offset, once I know it". I.e. the sync step @pcanal was referring to. It makes sense to parallelize that!

eguiraud commented 1 year ago

If you are not carefully, this might lead to the case where at start up for n-core/n-tasks, we might issue (n(n+1) / 2) files opens (eg. at the very least the first file being requested to be open n times).

Mmmh that's right...we'll have to be careful.

the global number is needed to load the proper friend

What I'm saying only applies when there are no friends. In case there are friend trees (or a TEntryList) currently TTreeProcessorMT opens all files once at the beginning to recover all tree entry numbers, and each task builds the full chain (and then processes a certain range of global entry numbers).

eguiraud commented 1 year ago

report your tree's entries back, and I will then - at some point in the near future - let you know your global offset

Ah, good idea, this avoids the problem Philippe mentioned above with all tasks trying to recover the number of entries of the first file at the same time.

vepadulano commented 1 year ago

I think the proposed idea (+ the discussion so far) makes sense, I have nothing to add on the spot. If in the end it still proves to have a tangible startup cost, we could also think about doing it only if the rdfentry_ column is actually requested in the application

pcanal commented 1 year ago

We could also think about doing it only if the rdfentry_ column is actually requested in the application

+1

eguiraud commented 1 year ago

report your tree's entries back, and I will then - at some point in the near future - let you know your global offset

@Axel-Naumann on second thought this is quite complicated...at the point one task (which e.g. might be processing tree #4) might require to know the number of entries in tree #1, #2 and #3 there is no guarantee that the corresponding tasks are even running.

doing it only if the rdfentry_ column is actually requested in the application

I don't think that at the moment RDF has any logic that "reflects" at a global level on which columns are used, but I guess we could add something ad-hoc.

Axel-Naumann commented 1 year ago

Good point.

I just dislike the submission computer having to open all files: it's potentially saturating the bandwidth, not as parallel as it could be (given there might be a whole cluster waiting), and the storage might be optimized for the cluster more than for the submission computer (a laptop?)

Can we - as a first task - submit to the workers the opening of all files / reporting of their entries? Only once they have reported back would then the main task start, possibly only if it uses friends or rdfentry.

eguiraud commented 1 year ago

I just dislike the submission computer having to open all files

Distributed RDF employs yet another strategy that doesn't require opening all files before submitting data processing tasks but potentially produces more unbalanced tasks or some empty tasks (it's unclear to me if that has any visible performance impact or not).

For local, multi-thread RDF, I guess we can either do the same that distributed RDF does or come up with a way to schedule this graph of tasks efficiently (with data processing task #N depending on ttree-entry-retrieval tasks #1, #2, ..., #N), e.g. via TBB task graphs.

eguiraud commented 1 year ago

After further discussion with @Axel-Naumann and @vepadulano we converged on the following strategy:

Now we have enough information to tell each task what the global offset of its TTree is.

We could also automatically generate a "patched up" version of the dataset spec after this first run (one that contains all TTree entry numbers) and suggest to users that they switch to that one.