go2scope / MM-storage-api

MMCore API proposal
0 stars 0 forks source link

How to handle metadata #2

Open marktsuchida opened 7 months ago

marktsuchida commented 7 months ago

Interesting ideas! I've only started looking at this so I hope to create more issues as I think about it in more detail, but the first question I have is what is going to be the strategy for handling metadata -- including application-generated metadata.

It looks like you have a std::string, which is certainly quite generic, but it is not clear to me how this is intended to work with different file formats. At least on the surface, it would seem that (1) if each file format device interprets the string differently, it will be rather unusable by an application programmer whereas (2) if every file format device uses a common data format for the string, then (aside from the need to propose such a format) serializing it to a string would incur unnecessary overhead (and, depending on the chosen format, could be error-prone).

tlambert03 commented 7 months ago

totally agree. I haven't had time to respond more fully yet (but generally also feel positively about this initiative!) ... but metadata handling is also my first question. After taking various stabs at metadata, I haven't hit on anything I particularly like personally, nor do I think the current MM approach(es) are necessarily worth emulating.

aside from the need to propose such a format

This indeed becomes a big issue. One thing I have begun to think is that there will be almost no "universally" good solution. Some will want comprehensive metadata, possibly at the expense of performance, others may want bare minimal metadata that takes no time to fetch.

I have played a bit with a graphql-like pattern and will "think about it aloud" here, not particular advocating for or against it. Graphql is a query language that is characterized by the request (query) taking the form of the data you want to receive back. So, supposing there is an object Thing with 30 possible fields, and you only want name and id, your query looks like

{
  thing{
    name
    id
  }
}

I'm not suggesting we use actually use graphql for anything other than inspiration in the concept that "we (MM) have a schema/API of all the state that we could retrieve" and "they (the storage implementation) declare what things they need to populate whatever metadata scheme they intend to write".

For example, see these TypedDict structures declared in pymmcore-plus: https://github.com/pymmcore-plus/pymmcore-plus/blob/main/src/pymmcore_plus/core/_state.py#L10-L88 ... with the "full data" being a fully populated StateDict (see link for what the nested dicts look like), and in pymmcore-plus, you use CMMCorePlus.state() to retrieve it, with parameters declaring how "full" the dict is:

class StateDict(TypedDict, total=False):
    Devices: dict[str, dict[str, str]]
    SystemInfo: SystemInfoDict
    SystemStatus: SystemStatusDict
    ConfigGroups: dict[str, dict[str, Any]]
    Image: ImageDict
    Position: PositionDict
    AutoFocus: AutoFocusDict
    PixelSizeConfig: dict[str, str | PixelSizeConfigDict]
    DeviceTypes: dict[str, DeviceTypeDict]

a storage backend could, for example, give us this string (here as graphql, but could be anything):

{
  Devices {
    Camera {
      Binning
      Offset
      Exposure
    }
    Dichroic {
      Label
    }
  }
  ConfigGroups {
    Channel {
      current
    }
  }
  SystemInfo {
    VersionInfo
  }
}

and then a fast function could be prepared that would retrieve and return only what is needed:

{
  "Devices": {
    "Camera": {
      "Binning": "1",
      "Offset": "0",
      "Exposure": "100"
    }
  },
  "Dichroic": {
    "Label": "400DCLP"
  },
  "ConfigGroups": {
    "Channel": {
      "current": "DAPI"
    }
  },
  "SystemInfo": { "VersionInfo": "MMCore version 11.0.0" }
}

... and there could be both a fast per-frame query and a slower start-finish query (with more info if desired).

This leaves the question what data is necessary up to the Storage device: if it wants to write OME XML, fine, if it doesn't need all that, also fine.

tlambert03 commented 7 months ago

I guess it's also possible that this is way too complicated, and just letting them directly use the core api could be better :)

go2scope commented 7 months ago

I agree with everything above, and here is my comment. The metadata strings are supposed to be JSON-encoded data structures. Metadata handling is indeed a hard problem. I do not believe we can develop a universal schema for metadata, and I welcome any ideas in this direction.

We can postulate that StorageDevice and MMCore must be able to automatically generate a minimal set of metadata to make a dataset readable. Metadata strings in the API are supposed to be optional. The dataset must be readable even if metadata passed through the API is incomprehensible.

In short:

go2scope commented 7 months ago

If the general idea of having the storage implemented in MMCore is worth developing further, we can pick a couple of widely used formats today and imagine how the client code would look for each.

For example, the simple MMCore API would work for writing generic Zarr datasets, but if we say it must be an OME Zarr dataset, it becomes more interesting. If we don't pass any metadata or if the passed metadata does not contain all the required information (or cannot be interpreted), the API would have to auto-generate all the necessary fields.

Therefore, to write a perfect OME Zarr dataset, significant cooperation is required between the StorageDevice/MMCore and the calling application. This cooperation can be achieved only through metadata strings. Of course, that is not great. I don't have any good suggestions. Something like what Tally mentioned might be a way to go.