ome / omero-cli-zarr

https://pypi.org/project/omero-cli-zarr/
GNU General Public License v2.0
15 stars 10 forks source link

Standardize _creator field #48

Open joshmoore opened 3 years ago

joshmoore commented 3 years ago

Currently,

(z) /opt/omero-ms-zarr $cat 101.zarr/.zattrs
{
    "_creator": {
        "name": "omero-zarr",
        "version": "0.0.2.dev79+gb361c09"
    },

is added on export. We may want to slightly update this to match with a vocabulary like Dublin Core or W3C PROV.

sbesson commented 2 years ago

Another candidate vocabulary would be SoftwareApplication. This is also the vocabulary suggested in https://www.researchobject.org/ro-crate/1.0/#provenance-software-used-to-create-files.

The example above could be translated into:

          "@context": "https://schema.org",
          "@type": "SoftwareApplication",
          "name": "omero-cli-zarr",
          "version": "0.0.2.dev79+gb361c09"

Trying also to include the discussion around additional software information in https://github.com/ome/omero-cli-zarr/pull/76#discussion_r691978664, softwareAddon would be an option

          "@context": "https://schema.org",
          "@type": "SoftwareApplication",
          "name": "omero-cli-zarr",
          "version": "0.0.2.dev79+gb361c09",
          "softwareAddOn": {
               "@type": "SoftwareApplication",
               "name": "bioformat2raw",
               "version": "0.3.0",
          },
joshmoore commented 2 years ago

Generally looks interesting, but we'll need to figure out where it's attached. Only at the top level? (Do we have a standard structure there?) For each multiscale in case they are generated by different software. etc.

sbesson commented 2 years ago

https://github.com/ome/omero-cli-zarr/issues/48#issuecomment-902003430 is a use case where there is a one-to-one mapping between the software and the specification i.e.

multiscales -> bioformats2raw
omero -> omero-cli-zarr

So although it could be at the top-level, there is a case for defining it (or including a reference via @id) at the level of each specification. This is what the multiscales specification currently attempts to do via metadata. Maybe we want to generalize this to allow all specifications to inject provenance metadata in a metadata field?

For more granular provenance i.e. each dataset being generated by different software, maybe we want to allow metadata fields to be defined further down the path e.g.

{
   "multiscales":[
      {
         "version":"0.2",
         "name":"example",
         "datasets":[
            {
               "path":"0",
               "metadata":{
                  "@context":"https://schema.org",
                  "@type":"SoftwareApplication",
                  "name":"bioformat2raw",
                  "version":"0.3.0"
               }
            },
            {
               "path":"1",
               "metadata":{
                  "@context":"https://schema.org",
                  "@type":"SoftwareApplication",
                  "name":"mydownsampler",
                  "version":"0.1.0"
               }
            }
         ]
      }
   ]
}