Add a flag to the Index Block that indicates whether a data block encodes text

josephburkhart commented 1 month ago

Summary

Currently in GSD, log data seems to be encoded in such a way that it is not possible to determine whether it should be decoded as numerical or textual information. This can be seen in a MWE provided below, in which a GSD file is written with Hoomd, but upon reading with GSD's own built-in functions, one of the logged fields seems (at first glance) to be decoded incorrectly. This behavior is actually described in the docs, but it is difficult to find that description, and the behavior can feel very unexpected to newcomers. I suggest that we add a flag to the index block to indicate whether a block contains numerical or textual information, and then modify the behavior of GSD's built-in reading functions to behave accordingly.

Description

Here is a MWE of a simple simulation that writes trajectories and a message "hello" to an output file called test.gsd:

import hoomd
import gsd.hoomd

# Initialize Frame
frame = gsd.hoomd.Frame()
frame.particles.types = ["A"]

positions = list(itertools.product(numpy.linspace(-2, 2, 2, endpoint=False), repeat=3))
frame.particles.N = len(positions)
frame.particles.position = positions
frame.particles.typeid = [0] * frame.particles.N
frame.configuration.box = [8, 8, 8, 0, 0, 0]
frame.particles.mass = [1] * frame.particles.N
frame.particles.moment_inertia = [1, 1, 1] * frame.particles.N
frame.particles.orientation = [(1, 0, 0, 0)] * frame.particles.N

# Initialize Simulation
simulation = hoomd.Simulation(device=hoomd.device.CPU(), seed=1)
simulation.create_state_from_snapshot(frame)

# Create integrator
integrator = hoomd.md.Integrator(dt=0.001)
simulation.operations.integrator = integrator

method = hoomd.md.methods.Langevin(filter=hoomd.filter.All(), kT=1.0)

integrator.methods.append(method)

# Add logger and writer
logger = hoomd.logging.Logger()
logger[("message")] = (lambda: "hello", hoomd.logging.LoggerCategories.string)   #   <-- string data gets logged here
gsd_writer = hoomd.write.GSD(trigger=hoomd.trigger.Periodic(1),
    filename="test.gsd",
    logger=logger,
    filter=hoomd.filter.All()
)
simulation.operations.writers.append(gsd_writer)

# Run the simulation and flush the writer
simulation.run(3)
simulation.operations.writers[0].flush()

When test.gsd is opened in Ovito, the message field is displayed properly:

But when test.gsd is read with GSD's built-in functions, message instead contains arrays of integers:

print(gsd.hoomd.read_log("test.gsd"))
>>> {'configuration/step': array([1, 2, 3], dtype=uint64), 'log/message': array([[104, 101, 108, 108, 111,   0],
>>>       [104, 101, 108, 108, 111,   0],
>>>       [104, 101, 108, 108, 111,   0]], dtype=int8)}

with gsd.hoomd.open("test.gsd") as f:
    print(f[0].log["message"])
>>> [104 101 108 108 111   0]

These integers are the decimal form of Unicode names for the letters 'h', 'e', 'l', 'l', 'o', and null. Indeed, you can decode them this way:

messages = gsd.hoomd.read_log("test.gsd")["log/message"]
messages = [str(m, encoding="UTF-8").rstrip("\x00") for m in messages]
print(messages)
>>> ['hello', 'hello', 'hello']

After discussing this offline with @tcmoore3 , I learned that this behavior is not only expected, but also documented - still, this behavior seems strange to me. I understand why data is stored the way it is in GSD, as blobs are much more efficient for read/write tasks than plain text, but the Python API is intended to be approachable, allowing users to, for example, create a pandas dataframe from a GSD log in one line. If we want to make it easy for users to interact with GSD files through Python, I think we should make it so they don't have to convert decimal unicode to text on their own. Just as an example, the oneliner I linked above will break if the log contains varying textual information, because pandas requires uniform dimensions in each column.

Proposed solution

I am not familiar enough with GSD's specification and C and Python APIs to give detailed suggestions for a solution, but I do have some ideas. The file layer specification indicates that the Index Block contains an unused(?) line item called flags, which is "reserved for future use". I suggest that we consider adding a flag to indicate that a data block contains textual information.

If we implemented this flag, we would also need to modify the behavior of gsd.hoomd.open() and gsd.hoomd.read_log() to check for the flag and then automatically decode the decimal unicode. We would also need to add some logic in Hoomd handle the new behavior. Again, I am not familiar enough with Hoomd's internals to suggest a full solution, but this could involve modifying the behavior of hoomd.write.GSD to set the flag when a logger passes it string values.

joaander commented 1 month ago

For various reasons, the main implementation of GSD is a pure C library, and C lacks any dedicated string type. GSD also lacked logged quantities in the initial design, so the specification did not anticipate the need. When we later added logged quantities, we kept the C representation for strings that GSD previously used only internally. Unfortunately, the HDF5 writer does not support strings at all - due to the same limitations as pandas.

I would guess that Ovito interprets the any data chunk with the type GSD_TYPE_INT8 as a string. As you point out, there is no current way to distinguish such a chunk from a string and an array of signed 8-bit values.

What use case do you have for string output that fits more naturally in a GSD file than a plain text file written by write.Table?

If there is a strong need, we could bump the specification minor revision and add a new string data type: https://github.com/glotzerlab/gsd/blob/d31663d7c1b4ac66b152c21545213282208729e3/gsd/gsd.h#L21-L52

I would not use flags - a reader could not reasonably interpret e.g. Nx3 float32 data with the "string" flag set. We could either keep the null terminated representation or restrict string data chunks to be len(str) x 1 arrays with the length given by N. Null termination is required if we want to allow max(len(strings)) x M arrays of variable length strings.

josephburkhart commented 1 month ago

I'm learning a lot - I did not know that C lacks a dedicated string type.

The use case that got me interested in this was that I wanted my function, which initializes a Hoomd simulation, to add a logger that would record the row action that originally called the function. The reason I want this is so I can track which timesteps correspond to which actions in a GSD file that is appended to by multiple row actions. I know a potentially easier approach would be to just make a new GSD file with every action, but I still wanted to approach it this way so that I have fewer GSD files in each job directory. Keeping this data in a GSD file, rather than a file written by write.Table would make it easier for me to, for example, take a single GSD file and generate a color-coded plot of logged quantities, with different colors representing different stages in a multi-action workflow.

What you say about types over flags makes sense - I'm just not familiar enough with C to know the details of implementation.

If we decide not to add this functionality in GSD, I think it might be helpful to raise a warning on the Python side of Hoomd if it detects that a loggable callback can return a string. This warning could notify the user that string output will have to be decoded on file reading.

josephburkhart commented 1 month ago

Small update: I just learned that gsd.hoomd.read_log() actually errors when a GSD file contains textual data of varying length. If I run my MWE above twice, first with message="hello" then with message="hello!", and then try to read the log with gsd.hoomd.read_log("test.gsd"), I get the following error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[104], [line 1](vscode-notebook-cell:?execution_count=104&line=1)
----> [1](vscode-notebook-cell:?execution_count=104&line=1) print(gsd.hoomd.read_log("test.gsd"))

File ~/miniconda3/envs/glotzergroup/lib/python3.11/site-packages/gsd/hoomd.py:1176, in read_log(name, scalar_only)
   [1174](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/joseph/GlotzerGroup/pentagonal-nanorods/~/miniconda3/envs/glotzergroup/lib/python3.11/site-packages/gsd/hoomd.py:1174)                     logged_data_dict[log][idx] = data[0]
   [1175](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/joseph/GlotzerGroup/pentagonal-nanorods/~/miniconda3/envs/glotzergroup/lib/python3.11/site-packages/gsd/hoomd.py:1175)                 else:
-> [1176](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/joseph/GlotzerGroup/pentagonal-nanorods/~/miniconda3/envs/glotzergroup/lib/python3.11/site-packages/gsd/hoomd.py:1176)                     logged_data_dict[log][idx] = data
   [1178](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/joseph/GlotzerGroup/pentagonal-nanorods/~/miniconda3/envs/glotzergroup/lib/python3.11/site-packages/gsd/hoomd.py:1178) return logged_data_dict

ValueError: could not broadcast input array from shape (7,) into shape (6,)

Again, Ovito handles the data correctly and opens the GSD file without issue.

glotzerlab / gsd