openPMD / openPMD-standard

:notebook: Open Standard for Particle-Mesh Data
http://www.openPMD.org
Creative Commons Attribution 4.0 International
78 stars 27 forks source link

Lists of Iteration, Mesh and Particle Pathes #282

Open franzpoeschel opened 1 year ago

franzpoeschel commented 1 year ago

Please add a brief description (one sentence) here and link the issue this pull-request implements

This is a suggestion for making openPMD markup be usable in a user-specified group hierarchy within an Iteration.

Implements issue: https://github.com/openPMD/openPMD-standard/issues/115

Description

As detailed description as possible.

In principle, custom data can be introduced to openPMD datasets today already without the openPMD standard being any wiser about it, since the standard only describes what must be there. As a result, custom groups, datasets, attributes may be used today already.

With this PR, datasets will be able to profit from openPMD markup within such custom groups, e.g. meshes and particles can be placed in user-defined places within iterations now:

Bildschirmfoto vom 2023-09-12 11-01-36

What is introduced, removed or renamed and why?

The only change necessary to this end is a new definition of the meshesPath and particlesPath attributes. The new definition is defined to be backward compatible with the old definition, e.g. existing datasets using the openPMD 1.0 specification for e.g. meshesPath will still work with openPMD 2.0.

meshesPath and particlesPath are now lists of regular expressions. (This is technically redundant as regular expressions are powerful enough to represent "lists" (e1|e2|e3), but it reduces reliance on regular expressions, and the concept of "listing" paths is important enough to be explicitly expressed in the standard.)

This way, the meshesPath and particlesPath can list the locations where meshes and particles species may be found (either as a relative or an absolute path).

What is made required, recommended, optional?

It no longer makes sense to require that a mesh exists when the meshesPath is defined, likewise the particlesPath (regular expressions might in fact denote an infinite set of locations).

What concept stands behind this change?

The fact that the openPMD standard is extensible and hierarchical. This proposal makes the openPMD standard aware of extended hierarchies and allows interacting with them.

Please also add an example.

-- will follow --

Affected Components

Logic Changes

Which logic changes are conceptually introduced?

Writer Changes

How does this change affect data writers?

Not at all if the new features are not used.

What would a writer need to change?

It can optionally use the new features.

Does this pull request change the interpretation of existing data writers?

No.

Reader Changes

How does this change affect data readers?

Data readers can no longer expect to have a single group containing meshes (likewise particles). Especially in the openPMD-api, the semantics of iteration[0].meshes become unclear when in fact the dataset contains /data/0/fields and /data/0/images.

Data readers need to traverse subdirectories if they intend to find all data.

The meshesPath and particlesPath attributes might now have values that old readers cannot interpret.

What would a reader need to change? Link implementation examples!

Data Converter

How does this affect already existing files with previous versions of the standard?

Not at all.

Is this change possible to be expressed in a way, that an automated tool can update the file? Link code/pull requests that implement the upgrade logic!

Not necessary.

Discussion:

ax3l commented 12 months ago

Thank you for the standard update and PR, this is excellent.

Let's discuss details in person, yet let me answer some ideas on the open questions above.

Should we use regular expressions at all? List of strings might also suffice. Otherwise, bash-like globbing might be a good compromise, but we might need to implement it manually.

I think regex support (e.g., which flavor) is a bit tricky across languages. But clearly, there is a big benefit in at least supporting globbing for now, because it avoids that every new entry needs to be listed. E.g., for sub-hierarchies that are numbered in a sensible manner.

My feeling is to allow only bash-like globbing for now, which is easy to replace to a regex or matching syntax in any language.

DavidSagan commented 12 months ago

My feeling is to allow only bash-like globbing for now, which is easy to replace to a regex or matching syntax in any language.

I would be cautious here since globbing and regex are incompatible. If, say, you allow bash-like globbing initially, to be backwards compatible when regex is introduced the standard must support both. This is a bit messy.

ax3l commented 12 months ago

Yes, such a change would then be for 3.0 and major. But needs a motivating set of examples why regex parsing as a dependency is needed. I just want to avoid loosing, e.g., Fortran users or people not having the exact implementation of regex flavor we require.

ax3l commented 12 months ago

Alternative approach: [...] allow overriding the meshesPath and particlesPath at each level of the hierarchy

Discussed today on nesting (minutes): Although this is cool and modular, there are a few severe downsides to this for readers if we did this now. The problem is in series, where data sets appear later in the iterations. To discover quickly in a reader the data that a user can select and read, say openPMD-viewer GUI or ParaView, one would need to open and parse all iterations to find where meshes and particles exist - and present a common set that can be selected. (Mostly an issue with fileBased encoding.)

I would be interested in concrete examples that really need this "discovery" feature to explore it more. (Do we have some deep nested structures we need to support interplay with NeXuS?)

One way to implement this with faster parsing for readers: one could be to have recursive paths from root to these elements that aid discovery: a list of paths to iteration folder(s), each of them having attributes that point to their list of mesh paths and particle paths.