rjones30 / HDDM

Hierarchical Document Data Model
Apache License 2.0
0 stars 2 forks source link

HDDM - efficient i/o library for self-describing structured scientific data

Hierarchical Document Data Model

HDDM is a tool for automatic building of a full-featured C++ library for representation of highly structured scientific data in memory, complete with a performant i/o library for integrated storage and retrieval of unlimited amounts of repetitive data with associated metadata. Starting from a structured document written in plain text, where the user describes the data values and relationships to be expressed, the HDDM tools automatically generate custom C++ header and source files that define new user classes for building an object-oriented representation of the data in memory, storing them in a standard format in disk files for retrieval later, and efficient means for browsing/manipulation of the data using familiar OO semantics in the user's C++ or python analysis application. All of the following features that a user expects fromd a big-data modeling and i/o library are supported by HDDM.

In addition to meeting the above requirements, this package combines the following features in a unique way.

Applications

HDDM was designed in response to the needs of particle physics experiments producing petabytes of data per year, but nothing in the design is specific to that application. HDDM is of general utility for any application with large datasets consisting of highly structured data. In contrast to simple data, like photographic images consisting of regular arrays of floats or color vectors, structured data consist of heterogeneous values of various types (variable-length strings, variable-length lists of ntuples, lists of variable-length lists...) that are related to one another through a hierarchical graph. Data from advanced scientific instruments typically contain repeated blocks of a basic pattern of such relationships, with variations in the number of nodes connecting to each point in the graph from one block to the next.

An xml document provides a flexible means to represent such a hierarchical graph, where the xml tags represent the data and their nesting reflects the relationships in the graph. At the top of the graph is the largest repeating pattern in the dataset, also called a record. At the bottom are the individual values representing the measured data in terms of integers and floats, together with their units. In between are the intermediate nodes of the graph that represent the ways the different values come together to form a single record from the instrument. Once this graph has been written down in the form of a structured xml document, the HDDM tools read the xml and automatically generates a custom set of C++ / python classes. The user's application can then include / import these classes and use them to read the raw data from the instrument into C++ objects for subsequent storage on disk in a standard format, and for final analysis.

Documentation

The documentation for HDDM consists of three parts: a description of the data modeling language used by the xml record template and the associated schema, a description of the user application interface in C++ and python that gives access to the generated data and i/o classes provided by the library, and instructions on how to use the tools through the examples provided with the package as a guide to users writing their own custom applications. All three of these have now been combined into The HDDM User's Guide. Instructions for building the HDDM tools from sources are found in the INSTALL file distributed with the sources.

Dependencies

HDDM relies on the following external open-source packages. Some must be installed on the user platform before HDDM can be built, and others are optional.

Streaming readers

Extensions are available to the core i/o functionality of the generated HDDM libraries for reading from streaming data sources using HTTP/s and XRootD protocols. The most immediate application in view is the capability to read from large hddm data files hosted on a remote server without first having to download the entire file and then read the data from local storage. If the hddm model library is built with streaming input support then substitution of a httpIstream object constructed as

in the place of a ifstream("mydatafile.hddm"), or similarly an xrootdIstream object constructed as

is all that is required to access the streaming input capability through the C++ api. Using the python module, simply supply a url string in the place of the input filename provided to the istream constructor. Building your hddm library with streaming support requires that you check out the streaming_input branch of HDDM instead of main. The build instructions for the streaming_input branch are the same as main, with the following additional dependencies.

Acknowledgements

HDDM contains as a part of its source codebase a sub-package named xstream, which is a fork of an earlier open-source package that was released as xstream 2.1 by its author Claudio Valente in 1999 under the GNU LESSER GENERAL PUBLIC LICENSE. The original author and license is included unchanged under xstream/AUTHOR and xstream/COPYING. The original README written by Claudio Valente is also included. The HDDM fork of xstream 2.1 was made in 2004 in order to correct some bugs in the original v2.1 code and to add new features related to stream repositioning and multi-threaded compression/decompression. These changes made the HDDM fork of xstream no longer backward-compatible with xstream 2.1. With open acknowledgement of the important contribution of xstream 2.1 by Claudio Valente to this project, the release here of the modified xstream code under an Apache open-source license is deemed consistent with the terms of the original LGPL license that accompanied Valente's release of xstream 2.1. The original C++ xstream 2.1 package released in 1999 is apparently unrelated to a number of other currently active open-source projects named xstream, including the java project XStream by Joe Walnes et al, the javascript project xstream by Andre Staltz, among others.

The author acknowledges support from the United States National Science Foundation that has enabled the development of this package within the context of the University of Connecticut nuclear physics research group, where the author serves as a professor.

Contact

HDDM is released as a public github project under an Apache Open-Source license by its designer and developer, Richard Jones, richard.t.jones(at)uconn.edu. On-going development of HDDM and user support is provided by the author to the GlueX Collaboration as a part of his contribution to the GlueX Experiment at Jefferson Lab in Newport News, Virginia. Support for other users of HDDM will be provided by the author on an as-able basis.