scikit-hep / histbook

Versatile, high-performance histogram toolkit for Numpy.
BSD 3-Clause "New" or "Revised" License
109 stars 9 forks source link

Need a way to project a booking tree to a particular sample #20

Closed jpivarski closed 6 years ago

jpivarski commented 6 years ago

... so that you can project to just the histograms that should be filled by a particular Monte Carlo sample and call .fill(mc) on only that projection. Lukas suggested the name "view", which I like for its resonance with Numpy.

lukasheinrich commented 6 years ago

there could be a nice fluent interface that goes

book = Book(..)
book.view(...).fill(...)

the arguments of .view() (or whatever name we end up using) would determine the projection

it would be nice if the arguments could support some kind of grammar that allows easy composition (e.g. using ORs to combine two sets of histograms)

jpivarski commented 6 years ago

I was thinking that each sample would have a string name and the argument to view would either be a single string or a list of strings to get a union (and therefore fill all selected). The classes would be structured such that you can't put named samples within named samples, and therefore the sample names are exclusive. With that constraint, a string or list of strings would be a completely sufficient grammar.

How about this: there are fillable Books and non-fillable Books. A Book that contains samples (as in "data", "signal", "Drell-Yan", ...) is non-fillable because filling all of these from a single dataset is almost always a mistake. Calling .view(sample_or_samples) on such a Book creates a fillable version of that Book.

Here's what the class hierarchy could look like:

# histograms for dimuon mass in data
# SystematicsBook ensures that all systematics have the same number of dimensions
dimuon_mass_data_central = Hist(mass_fcn_central, systematic=[0])  # central value
dimuon_mass_data_up = Hist(mass_fcn_up, systematic=[1])  # one sigma up
dimuon_mass_data_down = Hist(mass_fcn_down, systematic=[-1])  # one sigma down
dimuon_mass_data = SystematicsBook(dimuon_mass_data_central,
                                   dimuon_mass_data_up,
                                   dimuon_mass_data_down)

# histograms for dimuon mass in signal MC
dimuon_mass_signal_central = Hist(mass_fcn_central, systematic=[0])  # central value
dimuon_mass_signal_up = Hist(mass_fcn_up, systematic=[1])  # one sigma up
dimuon_mass_signal_down = Hist(mass_fcn_down, systematic=[-1])  # one sigma down
dimuon_mass_signal = SystematicsBook(dimuon_mass_signal_central,
                                     dimuon_mass_signal_up,
                                     dimuon_mass_signal_down)

# two histograms without systematics for muon truth in signal MC only
# could be a single histogram or book; just a bag of independent bins to pyhf
muon_truth_signal = Book(par1=Hist(muon_truth_par1_fcn),
                         par2=Hist(muon_truth_par2_fcn))

# define each channel as a SamplesBook
dimuon_mass = SamplesBook(data=dimuon_mass_data, signal=dimuon_mass_signal)
mass_truth = SamplesBook(signal=muon_truth_signal)

# define everything as one big ChannelsBook
all_channels = ChannelsBook(mass=dimuon_mass, truth=muon_truth)

# view and fill using the fluent interface you suggested
# these could be sent to different jobs (one for data, one for signal) and safely added
# the bookkeeping is simplified by having only one object to duplicate in each job
# also, the unified `fill` method lets me maximize data reuse across channels
all_channels.view("data").fill(data_df)
all_channels.view("signal").fill(signal_df)

# plot a channel
all_channels.stack("mass").to(vega.VegaLite)
all_channels.stack("truth").to(vega.VegaLite)

ChannelsBook and SamplesBook are non-fillable but calling view on them produces fillable versions of them. SystematicsBooks, generic Books, and Hists are fillable (with no view method) as Book and Hist are today.

Looking at this set-up, there's so much regularity here that maybe the constructors should do all the copy-pasting, rather than asserting that the copy-pasting was correct.

all_channels = \
    ChannelsBook(mass=SamplesBook(["data", "signal"],
                                  SystematicsBook(Hist(mass_fcn_central, systematic=[0]),
                                                  Hist(mass_fcn_up, systematic=[1]),
                                                  Hist(mass_fcn_down, systematic=[-1]))),
                 truth=SamplesBook(["signal"],
                                   Book(par1=Hist(muon_truth_par1_fcn),
                                        par2=Hist(muon_truth_par2_fcn))))

This allows for not all samples to show up in a given channel ("mass" has "data" and "signal" but "truth" has only "signal") and systematics variation is optional. Everything above can be generated from this but with less risk of copy-paste error.

Perhaps these short constructors should be the default ones and the long-hand versions should be classmethod constructors.

jpivarski commented 6 years ago

We can make semantic histogram books like this:

>>> everything = ChannelsBook(
    mass = SamplesBook(["data", "signal", "background"],
                  SystematicsBook(Hist(bin("x", 5, 0, 5), systematic=[0]),
                                  Hist(bin("x + epsilon", 5, 0, 5), systematic=[1]),
                                  Hist(bin("x - epsilon", 5, 0, 5), systematic=[-1]))),
    truth = SamplesBook(["signal", "background"],
                        Book(par1=Hist(bin("par1", 5, 0, 5)),
                             par2=Hist(bin("par2", 5, 0, 5)))))

The SystematicsBook ensures that all contents have the same binning (though expressions can differ, of course!), and the SamplesBook duplicates the histograms it is given. ChannelsBook is a regular, non-fillable book. Here's what they look like when printed (__str__, not __repr__):

>>> print(everything)
ChannelsBook({
      'mass': SamplesBook({
            'data': Book({
                  '0': SystematicsBook({
                        '0': Hist(bin('x', 5, 0.0, 5.0)),
                        '1': Hist(bin('x + epsilon', 5, 0.0, 5.0)),
                        '2': Hist(bin('x - epsilon', 5, 0.0, 5.0))
                        })
                  }),
            'signal': Book({
                  '0': SystematicsBook({
                        '0': Hist(bin('x', 5, 0.0, 5.0)),
                        '1': Hist(bin('x + epsilon', 5, 0.0, 5.0)),
                        '2': Hist(bin('x - epsilon', 5, 0.0, 5.0))
                        })
                  }),
            'background': Book({
                  '0': SystematicsBook({
                        '0': Hist(bin('x', 5, 0.0, 5.0)),
                        '1': Hist(bin('x + epsilon', 5, 0.0, 5.0)),
                        '2': Hist(bin('x - epsilon', 5, 0.0, 5.0))
                        })
                  })
            }),
      'truth': SamplesBook({
            'signal': Book({
                  '0': Book({
                        'par2': Hist(bin('par2', 5, 0.0, 5.0)),
                        'par1': Hist(bin('par1', 5, 0.0, 5.0))
                        })
                  }),
            'background': Book({
                  '0': Book({
                        'par2': Hist(bin('par2', 5, 0.0, 5.0)),
                        'par1': Hist(bin('par1', 5, 0.0, 5.0))
                        })
                  })
            })
      })

The whole thing is accessible by path-like keys:

>>> print(everything.allkeys())
['mass', 'mass/data', 'mass/data/0', 'mass/data/0/0', 'mass/data/0/1', 'mass/data/0/2',
 'mass/signal', 'mass/signal/0', 'mass/signal/0/0', 'mass/signal/0/1', 'mass/signal/0/2',
 'mass/background', 'mass/background/0', 'mass/background/0/0',
 'mass/background/0/1', 'mass/background/0/2', 'truth', 'truth/signal', 'truth/signal/0',
 'truth/signal/0/par2', 'truth/signal/0/par1', 'truth/background', 'truth/background/0',
 'truth/background/0/par2', 'truth/background/0/par1']

Generally speaking, an object like this can't be filled by a single data sample: the inputs probably come from qualitatively different files. To allow for selective filling, we can take views using wildcard patterns. For example,

>>> print(everything.view("*/signal/*"))   # select the signal sample
ViewBook({
      'mass': ViewBook({
            'signal': ViewBook({
                  '0': ViewBook({
                        '0': Hist(bin('x', 5, 0.0, 5.0)),
                        '1': Hist(bin('x + epsilon', 5, 0.0, 5.0)),
                        '2': Hist(bin('x - epsilon', 5, 0.0, 5.0))
                        })
                  })
            }),
      'truth': ViewBook({
            'signal': ViewBook({
                  '0': ViewBook({
                        'par2': Hist(bin('par2', 5, 0.0, 5.0)),
                        'par1': Hist(bin('par1', 5, 0.0, 5.0))
                        })
                  })
            })
      })

and

>>> print(everything.view("mass/*/*"))   # select the mass channel
ViewBook({
      'mass': ViewBook({
            'data': ViewBook({
                  '0': ViewBook({
                        '0': Hist(bin('x', 5, 0.0, 5.0)),
                        '1': Hist(bin('x + epsilon', 5, 0.0, 5.0)),
                        '2': Hist(bin('x - epsilon', 5, 0.0, 5.0))
                        })
                  }),
            'signal': ViewBook({
                  '0': ViewBook({
                        '0': Hist(bin('x', 5, 0.0, 5.0)),
                        '1': Hist(bin('x + epsilon', 5, 0.0, 5.0)),
                        '2': Hist(bin('x - epsilon', 5, 0.0, 5.0))
                        })
                  }),
            'background': ViewBook({
                  '0': ViewBook({
                        '0': Hist(bin('x', 5, 0.0, 5.0)),
                        '1': Hist(bin('x + epsilon', 5, 0.0, 5.0)),
                        '2': Hist(bin('x - epsilon', 5, 0.0, 5.0))
                        })
                  })
            })
      })

Different views have different sets of fields that have to be supplied when we fill it.

>>> print(everything.view("*/data/*").fields)
['epsilon', 'x']
>>> print(everything.view("*/signal/*").fields)
['epsilon', 'par1', 'par2', 'x']

Assuming you have the data for a particular view, you can fill it.

>>> everything.view("*/data/*").fill(
        x=numpy.random.uniform(0, 5, 100000),
        epsilon=numpy.random.normal(0, 0.01, 100000))

Only what we've asked to be filled is filled:

>>> everything["mass/data/0/0"].pandas()   # we filled all the data histograms
             count()  err(count())
x                                 
[-inf, 0.0)      0.0      0.000000
[0.0, 1.0)   20139.0    141.911945
[1.0, 2.0)   20172.0    142.028166
[2.0, 3.0)   20043.0    141.573303
[3.0, 4.0)   19779.0    140.637833
[4.0, 5.0)   19867.0    140.950346
[5.0, inf)       0.0      0.000000
{NaN}            0.0      0.000000

>>> everything["mass/data/0/1"].pandas()   # ...all the data histograms
             count()  err(count())
x + epsilon                       
[-inf, 0.0)     84.0      9.165151
[0.0, 1.0)   20063.0    141.643920
[1.0, 2.0)   20137.0    141.904898
[2.0, 3.0)   20049.0    141.594491
[3.0, 4.0)   19811.0    140.751554
[4.0, 5.0)   19786.0    140.662717
[5.0, inf)      70.0      8.366600
{NaN}            0.0      0.000000

>>> everything["mass/signal/0/0"].pandas()     # but not the signal histograms
             count()  err(count())
x                                 
[-inf, 0.0)      0.0           0.0
[0.0, 1.0)       0.0           0.0
[1.0, 2.0)       0.0           0.0
[2.0, 3.0)       0.0           0.0
[3.0, 4.0)       0.0           0.0
[4.0, 5.0)       0.0           0.0
[5.0, inf)       0.0           0.0
{NaN}            0.0           0.0

The wildcarding is from Python's fnmatch, which is shell-like glob patterns, rather than regular expressions. That will probably be more comfortable for interactive use.

lukasheinrich commented 6 years ago

Are the paths accessible via standard getitem if so it might be nice for interoperability to make the paths JSONPointers..

jpivarski commented 6 years ago

I need to learn about JSONPointers. I'll look it up.

jpivarski commented 6 years ago

RFC, examples in Python

I don't think I'll use this. For one thing, the tree of books are not like JSON objects and arrays, but only objects. The ones that are constructed as lists get filled with keys that are integers as strings: "0", "1", "2", etc.

What I like about the unrooted filesystem syntax that I'm using now are:

A lot of JSONPointer seems to be about escaping characters, like spaces. These are in strings, so escaping isn't necessary (and it would be a big surprise to users that their spaces have to be referred to by "%20"). The only special character is /, which automatically spawns sub-books upon creation (like git filenames).

I should probably escape/unescape \/ uniformly, though. Just in case someone wants a slash that doesn't spawn sub-books.