joaander commented 7 years ago

Overview

There is a need to store auxiliary state information on GSD files. Auxilliary state information may be present in a variety of updaters/analyzers/computes (i.e. NVT internal variables) and is critical information when restarting simulations.

I refer to this state as auxilliary because it is not a part of the central System Definition, but is class specific data owned and managed by any of a variety of classes that a user may have instantiated. I hope to make a clear distinction between this auxiliary simulation state, which is both changes affects the simulation as it progresses, from parameters which are fixed (i.e. force field parameters). The distinction is not completely clear, however, as a parameter in one user's case, may be state information in another - we will choose what is auxiliary state when there are a substantial number of users treating it as such.

Additionally, users commonly want quick and easy access to particle shape information so that visualizer tools can render particles with shape. I firmly stand behind my opinion that choices about how data is to be visualized should be kept separate from the simulation. However, I am willing to facilitate making data available in the gsd file that visualizers can choose to use. In ~80% of the use cases, I a statepoint, parameter, or metadata output file can provide this information - however most users seem to find this an insurmountable energy barrier.

There are enough use cases where shape data is changing from frame to frame where metadata output is not sufficient. Thus, we should shape data in the gsd file, in binary form to support full restart capabilities for these simulations as well.

Data chunk naming

Data chunks storing state information should generally be named based on the class that produces the data. E.g. for NVT: state/md/integrate/nvt/values. However, may such classes may have multiple instances within a script - either active at separate points within the script or possibly active at the same time. So we need to include an instance name - automatically created or possibly user provided? state/md/integrate/nvt/${name}/values.

Shape data provides an even more complex situation. HOOMD has a number of different methods that can provide shape (anisotropic potentials, DEM, HPMC). If each of these were to store to locations based on their own class name, a visualizer tool would need to look in N different places for shape data - and possibly find it in all of them! Additionally, users might have multiple stages in their script:

integrate.sphere(...)
run(100)
integrate.convex_polyhedron(...)
run(100)

I do not know how to resolve this complexity. One idea would be to define a single location for shape data that multiple sources could write to. For example state/shape/convex_polyhedron/*, to which both HPMC convex_polyhedron and DEM 3D would write to. This makes for one place in the file to look for convex polyhedron shapes, but there may also be sphere, ellipsoid, or other shape data in the file as well.

Connection to class instances

Auxiliary state data lives in a number of disparate class instances. Somehow, these instances need to know when to write there data to a gsd file, and need to be given a gsd file handle to read in their data at the appropriate time (and know their own instance name in the file).

Whether to make these connections explicit or implicit is still completely open. @PaulDodd is working on a prototype implementation with an explicit API for testing.

Writing

Depending on use-cases, the user may want to write auxilliary state information to the restart gsd file, or trajectory gsd file (and there may be many trajectory gsd files). I suggest using signals to make these connections. The provider class has a slot writeGSDAuxState(gsd_handle) that is registered as a slot to a single in GSDDumpWriter. When GSDDumpWriter writes a frame, it fires the signal (if the user requests state/ information in the file), and the called writeGSDAuxState methods write out their state to the current frame in the file.

The registration can be made explicit by requiring a user script API call to make the registration, or implicit though behind-the-scenes tracking of objects in the simulation context.

One pitfall with this is that the concept of static state stored in frame 0 is not possible, because the classes writing the state may not exist until later in the simulation. This is not a severe problem - by definition, auxiliary state is dynamic and only makes sense to save on every frame.

Reading state information in

State information needs to be read in for restarting simulations.

For the prototype, we plan to make an explicit API that re-opens the gsd file and reads the necessary chunks. We need to play with the prototype API for a bit before we will fully understand what sort of API is appropriate for the final version, and how implicit or explicit it should be.

We could, for example, store the GSDReader used to initialize the system in the simulation context and use that open file later when reading state. This might run into problems, however, if the user write trajectory information to the same file on disk (a valid use case of gsd).

Subject to discussion and change

This description is based on discussions with @PaulDodd @vyasr @bvansade and @samnola . It is a concept open for discussion, continued debate, and revision. @PaulDodd is implementing the prototype API framework. @bvansade has expressed interest in implementing the NVT and NPT integrator state methods.

The hoomd schema

We have not discussed this yet, but it does need to be at some point. How do we document these new data chunks in the schema? Do we just say that everything in state/ is hoomd version specific and only for internal use? That is not particularly useful if external tools wish to read shape state for visualization. If we put it in the official schema spec, how do we document the chunk names that have instance names built in? Also, we will need to make absolutely sure that the schema we put in place will be able to remain stable for many years to come.

joaander commented 7 years ago

All, I typed up these notes to record the results of our discussions. Please feel free to comment if you have additional ideas, suggestions, or concerns with this concept.

@csadorf we would value your input as well.

joaander commented 7 years ago

Original comment by Carl Simon Adorf (Bitbucket: csadorf, GitHub: csadorf).

General comments

I strongly agree that information which is purely related to visualization (such as color) should not be stored within the schema. In addition, we should distinguish between input metadata and output metadata. Information about integration (that includes shape) belong to the first category, while analyzers and log information belong to the latter.

As Josh mentioned, the schema should be extremely well designed and since we have tools at hand that provide alternative means to store auxiliary state information, there is no need to rush this. In general, I think we should be extremely careful in “leaking” new data and metadata into the hoomd-blue schema.

Implementation

As discussed with Paul, I think it is a good idea to start by providing a very explicit experimental opt-in API. This will help us to figure things out without polluting the schema for years to come. After a long trial period we could (but certainly don’t have to) provide a slightly more implicit API.

On the issue of curating data for post-processing tools

One approach may be to store meta data such as information about shape very explicitly as part of the state branch and then link to those at a single location.

For example, we may have a multi stage approach, where we use DEM first and HPMC second. Shape information would be stored in /state/dem/integrate/nvt/type/A/swca for the first stage and in /state/hpmc/integrate/convex_polyhedron/type/A/vertices in the second stage. In this way we don't need to pretend that shapes simulated via DEM and HPMC, while strongly related, are the same.

However, to make it a little bit easier for post-processing tools, we could provide the following links:

Stage 1:

/aux/shape/type/A/convex_polyhedron -> /state/hpmc/integrate/convex_polyhedron/type/A

Stage 2:

/aux/shape/type/A/swca -> /state/dem/integrate/nvt/type/A

This would give us more flexibility in storing the data precisely while giving post-processing tools a little more leeway in gathering information they are actually interested in.

In cases where it is impossible provide those links, for example when using a simulation protocol that is in principle legal, but shapes are no longer clearly defined, we simply don't.

joaander commented 7 years ago

Issue #7 was marked as a duplicate of this issue.

joaander commented 7 years ago

I don't think that there is a crystal clear distinction between input and output metatdata that fits all users and use-cases. To one user, the force field parameters are a fixed input, to another they evolve over the course of the simulation. To one user, shape is a fixed input, and to another it evolves. We simply have to make a judgment call on what should and should not be included.

I only suggested a single location for shape to lower the energy barrier for post processing. If we are in agreement that it is better to have class-specific locations, then we can do that. I just considered it as a possible way to avoid an issue where users do not use this feature because of yet another energy barrier.

An explicit API first strategy makes sense. Both on the output and input side.

joaander commented 7 years ago

Original comment by Paul Dodd (Bitbucket: the_real_pdodd, GitHub: PaulDodd).

Hey all, I just pushed the first attempt at the api last night to a branch called gsd_state. I will summarize briefly here.

The high level api

I have only really implemented the hpmc integrators as an example. the hpmc integrators offer two methods:

connect_gsd(gsd, name, write_params=False, write_d=False, write_a=False) 
restore_gsd(filename, name, frame=0)

connect_gsd is used to connect the object with the gsd object, and the restore_gsd will read the data from the gsd file specified at filename and frame.

Current Schema

The name parameter is both of these functions specifies the base name to write to in the gsd file. In the examples below {type name} will be what ever name the user gives the types e.g. ‘A’, ‘B’, ‘0’.

sphere:

{name}/params/{type name}/radius

ellipsoids:

{name}/params/{type name}/a
{name}/params/{type name}/b
{name}/params/{type name}/c

convex [sphero]poly[gons | hedra]:

{name}/params/{type name}/N
{name}/params/{type name}/vertices
{name}/params/{type name}/sweep_radius

I have no strong opinions about naming so any feedback is welcome.

A few things to think about:

If the shape information should be stored in one place in the file then maybe we add a special signal for this particular entry, where only one slot can be connected at a time. This slot writes to the name shape/{type_name}/… and the other objects can connect to the gsd signal in the usual way and write to an explicit name for now. Then there would be corresponding api’s connect_gsd_shape and restore_gsd_shape.
Since the user is inputing the name, there may need to be some checks to make sure it is valid and unique.
more complicated shape structures need to be initialized using the make_params functions defined in ShapeProxy.h. so these need to be changed not to interface solely with python structures.
add data structure validation tests.

For more examples look at hoomd/hpmc/test-py/hpmc_gsd_state.py

joaander commented 7 years ago

I haven't had a chance to look at the code yet, but I do see at least one major problem with encoding type name as part of the data chunk name, or otherwise allowing for user provided names. gsd has a maximum chunk name limit of 63 characters The file format was designed for fixed name chunks in a fixed schema so that readers could query if a given chunk exists or not and avoids the complexity of iterating through a list of datasets and examining metadata to determine if it should be read and what it means.

We could collapse the information into one list for all type names. N and sweep_radius would be number of types long, and vertices could be a stream-compacted list of all verticies for all types. Is such a format doable for all shape types?

joaander commented 7 years ago

Here is an example of what I mean by a stream compacted format.

state/hpmc/convex_polyhedron/N: [2, 4]
state/hpmc/convex_polyhedron/sweep_radius: [1.0, 0.0]
state/hpmc/convex_polyhedron/verticies: [[-1, 0, 0], [1,0,0], [-1, -1, 0], [-1, 1, 0], [1, 1, 0], [1, -1, 0]]

In this way: vertices[0:2] is for the first type and verticies[2:] is for the second. In general, let offset[i] = sum(N[:i]). Then for type i you index the vertices starting at offset[i] to N[i]+offset[i].

This solves the naming problems and gives a single well defined location for the HPMC shape information.

joaander commented 7 years ago

Regarding the more complex shapes - we don't need to solve everything at once. Start with the simple shapes that we need and individuals motivated by science needs can add others at a later time.

joaander commented 7 years ago

Original comment by Carl Simon Adorf (Bitbucket: csadorf, GitHub: csadorf).

Although I drafted a schema with type names earlier, I agree with Josh that the type name should not be part of the url since they merely represent aliases to the corresponding type ids anyways.

I think the typeid should represent the single source of truth for all other quantities.

joaander commented 7 years ago

Merged in gsd_state_pr (pull request #322)

fixes #222

Gsd state pr

Approved-by: Joshua Anderson joaander@umich.edu Approved-by: Carl Simon Adorf csadorf@umich.edu Approved-by: Vyas Ramasubramani vramasub@umich.edu Approved-by: Bryan Vansaders bvansade@umich.edu

glotzerlab / hoomd-blue

Store auxilliary state information in GSD files #222

Overview

Data chunk naming

Connection to class instances

Writing

Reading state information in

Subject to discussion and change

The hoomd schema

General comments

Implementation

On the issue of curating data for post-processing tools

The high level api

Current Schema