A standard schema for semi-structured dataset specification formats

Many analyses and frameworks offer the possibility of using a semi-structured format to represent both the full dataset (in terms of the files that need to be processed) and some metadata attached to it. Most often, the files are split into multiple groupings, which are usually called "samples" or "datasets". A few examples of this pattern are:

This helps a lot in keeping the configuration of the analysis dataset tidy and short, so that it can also be more easily shared to others and compared between different executions.

But this also calls for a discussion on providing a standard schema that all frameworks could accept and digest into their own execution workflows. For example, it is quite natural to always include a list of files in each "sample".

This new standard schema should support a few key features of the definition of a dataset:

Specify groups of several input files, each with associated metadata
Decide what term to use instead of "groups"
Cover TTree but plan for RNTuple support
Should support friend trees per group
Should support entry ranges per group
No indexed friend trees (at least for now)

One possible starting point would be (using "samples"="groups"="datasets"):

{
    "samples":{
        "sample_a":{
            "treenames": ["Events"],
            "files": ["fa*.root"],
            "friends":{
                "treenames": ["Friend"],
                "files": ["fr*.root"],
            },
            "entry_range": [0, 1000],
            "metadata": {...}
        },
        "sample_b":{
            "treenames": ["treeb_1", "treeb_2"],
            "files": ["fileb_1.root", "fileb_2.root"],
            "friends":{
                "treenames": ["Friend"],
                "files": ["fr*.root"],
            },
            "entry_range": [50000, 60000],
            "metadata": {...}
        },
        # ...
    },
    # other optional values from here on
    "optional": "value",
    # ...
}

Any ideas?

@NJManganelli , @bendavid , @swertz , @nsmith- , @lgray , @hageboeck , @eguiraud , @etejedor , @valsdav , @alexander-held

Any ideas?

Looks doable. I didn't notice anything that's missing on 1st read.

I was only wondering how you do this:

        "sample_a":{
            "treenames": ["Events", "ADifferentName"],
            "files": ["fa*.root", "TheWeirdFileWithTheDifferentName"],
            # ...
        },

What would you expect to happen in this case?

Any ideas?

Looks doable. I didn't notice anything that's missing on 1st read.

I was only wondering how you do this:
        "sample_a":{
            "treenames": ["Events", "ADifferentName"],
            "files": ["fa*.root", "TheWeirdFileWithTheDifferentName"],
            # ...
        },
What would you expect to happen in this case?

As of the current implementation - map the first file glob to the first tree name (so every tree in "fa*.root" is expected to be called "Events"). And then, the next file(glob) has tree(s) with the name "ADifferentName".

This however, requires the user to be careful to do the matching between treename and glob. Alternative, that I recall from the PPP discussion was to instead specify tree and file together - I imagine ["fa*.root?#Events", "ADifferentName?#TheWeirdFileWithTheDifferentName"].

Is this what you are asking, @hageboeck ?

Hi - this is an excellent start but I'd like to offer a few considerations given that not everyone uses root files these days. Specifically, friend trees are not a widely accepted concept outside of TTrees and RNtuple. It is, however, easy to abstract away from this so that this metadata specification is more universal.

In coffea, we didn't go all the way to formalizing it in a schema since there are many details but you can see the basic validation we do here: https://github.com/CoffeaTeam/coffea/blob/master/coffea/processor/executor.py#L1353 We allow forms like:

fileset = { "dataset" : ["some", "list", "of", "files"], ...}
# and
fileset = { "dataset": { "files": ["file1", "file2", ...], "treename": "Events", "metadata": { "stuff": "about stuff"}, ...}, { .... } }

depending on user need. I think optional and union types are very convenient here since not every user will need or want to specify all components of the full schema on each use (but uniformizing the description when needed is very important).

To repeat from above here is the suggested metadata requirements (with annotations). """ This new standard schema should support a few key features of the definition of a dataset:

Specify groups of several input files, each with associated metadata
Decide what term to use instead of "groups" (dataset is probably best)
Cover TTree but plan for RNTuple support (metadata should not care about file formats, your program does)
Should support friend trees per group *** (see below)
Should support entry ranges per group (why? this makes little operational sense from the POV of an analyst)
No indexed friend trees (at least for now) *** (also) """ This specification of requirements mixes things that are describing the data (metadata) and and how to react to it (something in a program, not part of the schema). I believe a clean factorization along these lines is very important to creating a well-adopted schema if you intend to develop a standard.

As to files - it is not very common but people do use parquet or hdf5 in analysis. Removing those formats as concepts to describe a dataset is rather limiting. Allowing them opens up many possibilities, including mixed modes and joins across rather heterogenous datasets. This can make things much easier in the case that, for instance, some random machine learning tool cannot output root files but can produce some other usefully structured data format.

- The more general term for this is a a join, and I think you should use that concept here rather than the precise concept of friends in TTree/RNtuple, which is limiting in scope for what is possible for dataset augmentation. Moreover, this allows the definition of left/right/inner/outer joins at the metadata level which is extremely useful for understanding how that additional data is intended to be used (are you just augmenting the number of columns in the dataset, are you x-referencing two datasets, etc.). It is then up to the system ingesting this data to implement the join specified by the user correctly (which can be tested for). For the second reference, this is also a restriction of your program, not the metadata, an error should be thrown by whatever is executing and cannot handle a case rather than restricting concepts for describing a dataset.

Furthermore, going to joins as a metadata concept allows the user to specify an entire dataset for a join rather than individual files, resulting in significant reduction of doubly-bookkept data.

Entry ranges: I don't think this is very useful data to record. This is either kept track of as a good-luminosity block list, or specified at execution time by the user (since it is often the case they will want to run over a limited piece of the data to test things and then run over the full dataset). Re-writing the metadata on each run would get cumbersome quickly.

To take all this and mutate your original suggestion (I haven't defined all the types but hopefully it's intelligible):

{
    "datasets":{
        "dataset":{
            "treenames": Union[List[String], String],
            "files": List[String],
            "friends":{
                "treenames": Union[List[String], String],
                "files": List[String],
                "joinType": OneOf["inner", "outer", "left", "right", "cross"], #this should just be made into a type
            } or List[Dict[As in Single Dict]],
            "metadata": Optional[Dict[JsonSerialiableAny:JsonSerializableAny]]
        },
        # ...
    },
    # other optional values from here on
    "metadata": Optional[Dict[JsonSerialiableAny:JsonSerialiableAny]]
}

Dear @hageboeck,

That specification would translate to

TChain c;
c.Add("fa*.root?#Events");
c.Add("TheWeirdFileWithTheDifferentName.root?#ADifferentName");

For that particular "dataset/sample" and "sample_a" would just be retrievable as part of the metadata during the event loop.

RDataFrame just processes the events in that TChain, currently you can distinguish the different file by calling a DefinePerSample and checking whether you are processing entries from that particular file and act accordingly. Probably it is not the most common case, but we have seen it happen and since it is supported by TChain, RDF needs to support it as well.

I wonder if, by creating the possibility of adding different "datasets/samples", we are now practically removing the need for this case, or if it still holds. If not, we could think about allowing only a single "dataset name" (i.e. treename) per "dataset", although I'm not sure that this would be generic enough.

Dear @lgray,

Thanks a lot for your input! Let me try to comment on the various parts.

I think optional and union types are very convenient here

Absolutely, I agree. Thank you also for the info about the data validation in coffea. The decision on the keys that need a union type should also be part of this formalization effort.

Decide what term to use instead of "groups" (dataset is probably best)

I see there are these two schools of thought but I cannot grasp how much of the community leans towards one vs the other. Do you think we should poll the larger audience at some point, for this and probably other questions? One other option could be just accepting both "datasets" and "samples" as the top-level key in the JSON object.

Cover TTree but plan for RNTuple support (metadata should not care about file formats, your program does)

Indeed, this point was more directed towards us developers rather than user-facing. Metadata will definitely be orthogonal to the data format.

Allowing them opens up many possibilities, including mixed modes and joins across rather heterogenous datasets.

This is an interesting comment, something that I hadn't put too much thought into. I think that it is much related with the other comment regarding joins. Mixed modes sounds intriguing although I can't see a clear path for its implementation in the I/O layer, rather we may be better off doing this at the analysis tool layer directly. In general, the usecase of reading the output of some ML pipeline during the execution of the analysis is definitely something we want to address. For what concerns this specification, the easy part is just deciding for something more generic than "treenames" when specifying these other data formats; the trickier part is deciding how these other input data should be read. See the next comment for more discussion about this.

The more general term for this is a a join, and I think you should use that concept here rather than the precise concept of friends in TTree/RNtuple

Yes I agree we can describe adding more columns to the main dataset as a join, with the implicit but crucial clarification that it is a view on the join operation and not a concrete join operation that would involve copying the two operands. With this sense, a friend TTree is equivalent to a left join where both unique IDs correspond to the event index and are the same number. I completely agree that this is a limit, indexed friends only extend it a little bit by allowing different sets of event indexes in the two tables. The idea of an heterogeneous dataset layout, with some datasets/samples having to be left-joined and others having to be inner-joined (for example), involves some design work and I would like to discuss it further, although I'd like to get a better idea of the use cases that need any joins other than left join.

Entry ranges: I don't think this is very useful data to record. This is either kept track of as a good-luminosity block list, or specified at execution time by the user

During the meeting a few weeks ago there was quite a large consensus on this information being useful when written at the datasets/samples level. I also agree with you that an entry range is usually specified when testing before running the full thing. Nonetheless, the important part of this feature was the ability to tie a specific entry range to a specific dataset/sample and not to the global dataset, so that even when testing at least N entries from each dataset/sample would be processed.

If this is not specified when definining the dataset metadata, then I suppose we should expose some API like:

entry_ranges = [(0,1000), (50000, 60000)] # taken from my initial specification example
df = RDataFrame(...)
df.SetEntryRanges(entry_ranges)

And similarly for coffea and other frameworks. What I don't like about this is that I need to remember how many datasets/samples I have in my specification so that len(entry_ranges) corresponds to that number. Of course the tool can error and say "You have specified too many entry ranges, please use only N", maybe that's good enough but I'm not sure. One other comment could be that we don't need to have exactly one entry range per dataset/sample, maybe some datasets just need to be processed fully. But the API from above would not be able to distinguish whether the user actually didn't want to provide an entry range for a certain dataset or just forgot how many datasets were there. How would you address this part?

To take all this and mutate your original suggestion

Thanks for taking the time to include this example. I am happy that you agree on having a single top-level key "datasets" in which all the various datasets can be defined. I think this opens possibilities to use the rest of the JSON file for describing more parts of the analysis while not touching the dataset specification.

I just wanted to ask you a clarification regarding the type List[Dict[As in Single Dict]] mentioned in the "friends" key. This is practically saying that instead of a single dictionary with those keys (treenames, files, joinType), there could be a list of dictionaries with the same keys, right? This is the implementation of your comment from above mentioning that users could specify an entire dataset with join relationships. I guess in your example it would mean that some files could be left-joined, some other files could be inner-joined, keeping always the same set of files as the "main dataset" for that particular "dataset/sample". Let me know if I got it right.

Dear @hageboeck,

That specification would translate to
TChain c;
c.Add("Events", "fa*.root");
c.Add("ADifferentName", "TheWeirdFileWithTheDifferentName");
For that particular "dataset/sample" and "sample_a" would just be retrievable as part of the metadata during the event loop.

RDataFrame just processes the events in that TChain, currently you can distinguish the different file by calling a DefinePerSample and checking whether you are processing entries from that particular file and act accordingly. Probably it is not the most common case, but we have seen it happen and since it is supported by TChain, RDF needs to support it as well.

I wonder if, by creating the possibility of adding different "datasets/samples", we are now practically removing the need for this case, or if it still holds. If not, we could think about allowing only a single "dataset name" (i.e. treename) per "dataset", although I'm not sure that this would be generic enough.

This is what I would have expected. I don't think that you need to do anything in addition. Just note that it's c.Add("path/to/file#treename"); not two arguments.

not two arguments

Indeed, thanks for checking. I modified with the current suggested syntax from TChain::Add.

Hey - will get back to this tomorrow. I've been at a workshop.

root-project / root

A standard schema for semi-structured dataset specification formats #11624