ome / ome2024-ngff-challenge

Project planning and material repository for the 2024 challenge to generate 1 PB of OME-Zarr data
https://pypi.org/project/ome2024-ngff-challenge/
BSD 3-Clause "New" or "Revised" License
11 stars 8 forks source link

Example ro-crate-metadata generation #9

Closed sherwoodf closed 1 month ago

sherwoodf commented 1 month ago

Created example code of how to generate a ro-crate-metadata.json with some REMBI & zarr extensions to generate the context and expected object links.

I wanted to actually use some of the RO Crate tooling to see what it would create in case that helps make decisions about what the metadata would look like / work out what would be easy for us to override/extend to provide to zarr users.

sherwoodf commented 1 month ago

@joshmoore @Tom-TBT @normanrz @francesw @matthewh-ebi

joshmoore commented 1 month ago

This is great, @sherwoodf! Thanks. Can you jot down a minimal workflow? (And when you would imagine it being called, e.g., before/after dev2/resave.py or perhaps it doesn't matter)

sherwoodf commented 1 month ago

Can you jot down a minimal workflow? (And when you would imagine it being called, e.g., before/after dev2/resave.py or perhaps it doesn't matter)

Sorry, didn't get back to this:

In terms of workflow, the toml can be used with poetry to set up a local python env and run the script:

poetry install
poetry run create_fly_ro_crate_metadata.py

But for actual pipelines to update to the new version, i would expect users to write scripts similar to create_fly_ro_crate_metadata.py (i'm assuming the metadata doesn't already exist somewhere in the existing zarrs).

The standard ro-crate logic does a deep copy of files to create the crate along with this json. I suspect it would be more efficient to only create the additional ro-crate-metadata files after the dev2/resave.py (similar to what i did in the create_fly_ro_crate_metadata script) - otherwise it would be necessary to blend the ro-crate creation process along with the resave.py logic to avoid creating copies of all the new zarr files.

joshmoore commented 1 month ago

i would expect users to write scripts similar to create_fly_ro_crate_metadata.py

Agreed, but I imagine that for many of us, there will be too many scripts needed to easily create them all, and so, reading from a CSV or similar may be preferred.

I suspect it would be more efficient to only create the additional ro-crate-metadata files after the dev2/resave.py

:+1:

joshmoore commented 1 month ago

Once #10 is in, I'll open a PR with something like this:

(challenge4) ~/opt/challenge/ome2024-ngff-challenge/dev2 $git diff
diff --git a/dev2/resave.py b/dev2/resave.py
index 36b004f..35d81e9 100755
--- a/dev2/resave.py
+++ b/dev2/resave.py
@@ -292,6 +292,36 @@ def convert_image(
                 ds_shards,
             )

+def write_rocrate(output_path: str):
+    from zarr_crate.zarr_extension import ZarrCrate
+    from zarr_crate.rembi_extension import Biosample, Specimen, ImageAcquistion
+
+    crate = ZarrCrate()
+
+    zarr_root = crate.add_dataset(
+        "./",
+        properties={
+            "name": "Light microscopy photo of a fly",
+            "description": "Light microscopy photo of a fruit fly.",
+            "licence": "https://creativecommons.org/licenses/by/4.0/",
+        },
+    )
+    biosample = crate.add(
+        Biosample(crate, properties={"organism_classification": {"@id": "NCBI:txid7227" }})
+    )
+    specimen = crate.add(Specimen(crate, biosample))
+    image_acquisition = crate.add(
+        ImageAcquistion(crate, specimen, properties={"fbbi_id": {"@id": "obo:FBbi_00000243"}})
+    )
+    zarr_root["resultOf"] = image_acquisition
+
+    metadata_dict = crate.metadata.generate()
+
+    filename = os.path.join(output_path, "ro-crate-metadata.json")
+    with open(filename, "w") as f:
+        f.write(json.dumps(metadata_dict, indent=2))
+
+

 def main(ns: argparse.Namespace):
     CONFIGS = create_configs(ns)
@@ -340,6 +370,7 @@ def main(ns: argparse.Namespace):
     else:
         write_store = STORES[1]
         write_root = zarr.Group.create(write_store)
+        write_rocrate(ns.output_path)

     # image...
     if read_root.attrs.get("multiscales"):

We can either write dummy values or take those values as CLI/CSV input.

joshmoore commented 1 month ago

nvm. I put it behind an optional flag and merged in the main branch so it's available in #10 now.