scikit-hep / histbook

Versatile, high-performance histogram toolkit for Numpy.
BSD 3-Clause "New" or "Revised" License
109 stars 9 forks source link

Export histogram metadata #16

Closed imandr closed 6 years ago

imandr commented 6 years ago

It would be useful if there was a way to dump histogram (and book ?) configuration to, say, JSON and to have an ability to recreate the empty histogram from it.

jpivarski commented 6 years ago

okay, will do

jpivarski commented 6 years ago

I'm working with @lukasheinrich to make histbook work with pyhf (analysis-quality fitting). The JSON format will be closely related to the pyhf JSON format.

But looking at the spec right now, I see that it doesn't say anything about how the histograms are binned, so I'll have to invent that part. From the fitter's point of view, it doesn't matter what the histogram low and high edges are (for example) because it just uses each bin content and error as an independent observable. It needs to know how many bins there are, though I don't see that in the JSON spec. (Perhaps that comes from following the "name" to a ROOT histogram and looking at its number of bins.)

So the histogram JSON spec will go beyond the fitting JSON spec, just as the fitting JSON spec goes beyond the histogram one (with information about how to normalize several histograms in a fit, etc.). Both JSON specs will therefore have optional members for compatibility— for instance to make a round trip from a fitting spec to a set of histograms and back. I wonder if JSONSchema allows us to define a union schema from which histbook_spec.json and pyhf_spec.json are derived?

lukasheinrich commented 6 years ago

Yes I think the view of a union schema is correct. For the fitting we really just need an array of numbers, everything else is optional metadata. Im the current schema the data is nested into all other data. As we discussed offline, the schema could be adapted to flatten the structure, where all histograms are stored in some toplevel structure (like a serialized Book) and we use a referencing mechanism (like JSONPointer or some other higher-level references resolved using custom software) to refer to these histograms in the nested fitting spec

data:
  ... serialized book
channels:
- name: channel1
   samples:
   - name: sample1
     data: {$ref: /data/some/pointer/into/the/book/to/get/sample1/array}
jpivarski commented 6 years ago

I need to learn JSONSchemas. For now, I'm going to do it the way I've always done it, with a manual .tojson and .fromjson method on each serialized class. That will be a first draft we can use to design the union schema.

@imandr: I'll have the JSON serialization done sooner, but it won't be a stable schema. We'll need to make changes. Hence, the early versions can be used for RPC but not for archiving.

lukasheinrich commented 6 years ago

JSON schemas do not do JSON-to-Object mapping (for this one has a Object-Document mappers like marshmallow https://github.com/marshmallow-code/marshmallow) but just validate a JSON Document against a schema -- so I think the .tojson .fromjson methods need to be present anyways (unless you want to use ODMs)

jpivarski commented 6 years ago

Okay, then I'll implement it the old fashioned way (no marshmallow) and we can later define it formally in a schema. Then, at least, the implementation can be checked.

jpivarski commented 6 years ago

This commit (on the "feature-booking-for-pyhf" branch) allows Hist (tested) and Book (untested) objects to be serialized to and from JSON. The original request was just asking for histogram metadata (everything except _content), but you can just h.cleared() to get a copy of the histogram without contents and serialize that.

The JSON format sacrifices performance for human-readability. The bin contents are a big block of lists-of-lists-of-lists-of-lists-of-lists...