mobie / mobie-utils-python

Python tools for MoBIE
MIT License
9 stars 5 forks source link

OME-Zarr conversion axes duplicates #121

Closed martinschorb closed 7 months ago

martinschorb commented 8 months ago

Hi,

I used mobie.add_image to convert TIFS to OME-Zarr.

Now I checked the pixel sizes and found some strange behaviour in the more recent conversions. The new .zattr contains the identical entry multiple times. It is well possible be that the conversion was run several times on an existing file (multiple batch runs on the cluster. But why does it add the exact same thing to an existing Zarr if the dataset with identical name already exists? also, the voxel data seems to be there only once, so it only affects the metadata.

{
  "multiscales": [
    {
      "axes": [
        {
          "name": "y",
          "type": "space",
          "unit": "nm"
        },
        {
          "name": "x",
          "type": "space",
          "unit": "nm"
        }
      ],
      "datasets": [
        {
          "coordinateTransformations": [
            {
              "scale": [
                1.676,
                1.676
              ],
              "type": "scale"
            }
          ],
          "path": "s0"
        },
        {
          "...": [
            {}
          ]
      "name": "VSM20_A3_AM2_014",
      "version": "0.4"
    },
    {
      "axes": [
        {
          "name": "y",
          "type": "space",
          "unit": "nm"
        },
        {
          "name": "x",
          "type": "space",
          "unit": "nm"
        }
      ],
      "datasets": [
        {
          "coordinateTransformations": [
            {
              "scale": [
                1.676,
                1.676
              ],
              "type": "scale"
            }
          ],
          "path": "s0"
        },
        {
          "...": [
            {}
          ]
      "name": "VSM20_A3_AM2_014",
      "version": "0.4"
    },
    {
      "axes": [
        {
          "name": "y",
          "type": "space",
          "unit": "nm"
        },
        {
          "name": "x",
          "type": "space",
          "unit": "nm"
        }
      ],
      "datasets": [
        {
          "coordinateTransformations": [
            {
              "scale": [
                1.676,
                1.676
              ],
              "type": "scale"
            }
          ],
          "path": "s0"
        },
        {
          "...": [
            {}
          ]
      "name": "VSM20_A3_AM2_014",
      "version": "0.4"
    }

Is there functionality to clean up such multiple entries from different Zarrs? How can this be avoided in the future? The conversions happened a while ago, so the exact versions of the packages were up to date at that time (July 12).

constantinpape commented 8 months ago

The new .zattr contains the identical entry multiple times. It is well possible be that the conversion was run several times on an existing file (multiple batch runs on the cluster. But why does it add the exact same thing to an existing Zarr if the dataset with identical name already exists?

I am not sure what is going on. I can look into this if you provide a minimal and self-contained example to reproduce this.

It is well possible be that the conversion was run several times on an existing file (multiple batch runs on the cluster. But why does it add the exact same thing to an existing Zarr if the dataset with identical name already exists? also, the voxel data seems to be there only once, so it only affects the metadata.

One potential explanation is that you start multiple jobs that add the same image and they all start before the metadata is added to dataset.json. In this case there will be some race conditions and undefined behavior that could result in what you describe. That would be a user error and you would need to update your code so that this does not happen. (I am of course not sure if this is really the issue, as I said I would need a minimal reproducible example to look into this further).

Is there functionality to clean up such multiple entries from different Zarrs?

Not that I am aware of, but it's fairly easy to write it with json (not tested but just to give you an idea):

import os
import json

def remove_duplicate_ngff_metadata(path_to_zarr):
  attrs_path = os.path.join(path_to_zarr, ".zattrs")
  with open(attrs_path) as f:
    metadata = json.load(f)

  multiscales = metadata["multiscales"]
  if len(multiscales) > 1:  # multiscales is a list which should have only one entry, so we can just get rid of the additional  entries
    multiscales = multiscales[0:1]

  metadata["multiscales"] = multiscales
  with open(attrs_path, "w") as f:
    json.dump(metadata, f)
martinschorb commented 8 months ago

Hi,

this is the code I used to convert.

https://github.com/mobie/environmental-dinoflagellate-atlas/blob/main/06-add2020_images.py

I ran it multiple times because it sometimes gets stuck when concurrently writing into the same dataset.json. In between, I was always removing tmp directories. This procedure is exactly how I did it before several times with different data (including earlier https://github.com/mobie/environmental-dinoflagellate-atlas/blob/main/01-add_images.py) without observing the multiplicated zarr metadata.

MRE is a bit tough to generate because when running it with single or very small dummy image files, the multi-process cluster submission will not have the same effects. Let me see if I can find a way to reproduce it.

Am 09.01.24 um 19:22 schrieb Constantin Pape:

The new |.zattr| contains the identical entry multiple times. It is well possible be that the conversion was run several times on an existing file (multiple batch runs on the cluster. But why does it add the exact same thing to an existing Zarr if the dataset with identical name already exists?

I am not sure what is going on. I can look into this if you provide a minimal and self-contained example to reproduce this.

It is well possible be that the conversion was run several times on an existing file (multiple batch runs on the cluster. But why does it add the exact same thing to an existing Zarr if the dataset with identical name already exists? also, the voxel data seems to be there only once, so it only affects the metadata.

One potential explanation is that you start multiple jobs that add the same image and they all start before the metadata is added to |dataset.json|. In this case there will be some race conditions and undefined behavior that could result in what you describe. That would be a user error and you would need to update your code so that this does not happen. (I am of course not sure if this is really the issue, as I said I would need a minimal reproducible example to look into this further).

Is there functionality to clean up such multiple entries from different Zarrs?

Not that I am aware of, but it's fairly easy to write it with |json| (not tested but just to give you an idea):

import os import json

def remove_duplicate_ngff_metadata(path_to_zarr): attrs_path = os.path.join(path_to_zarr,".zattrs") with open(attrs_path)as f: metadata = json.load(f)

multiscales = metadata["multiscales"] if len(multiscales) > 1:# multiscales is a list which should have only one entry, so we can just get rid of the additional entries multiscales = multiscales[0:1]

metadata["multiscales"]= multiscales with open(attrs_path,"w")as f: json.dump(metadata,f)

— Reply to this email directly, view it on GitHub https://github.com/mobie/mobie-utils-python/issues/121#issuecomment-1883564170, or unsubscribe https://github.com/notifications/unsubscribe-auth/AILSO6YQFQXRH3K4EF5ZJKDYNWDGZAVCNFSM6AAAAABBTJR3P2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOBTGU3DIMJXGA. You are receiving this because you authored the thread.Message ID: @.***>

constantinpape commented 8 months ago

I ran it multiple times because it sometimes gets stuck when concurrently writing into the same dataset.json. In between, I was always removing tmp directories. This procedure is exactly how I did it before several times with different data (including earlier https://github.com/mobie/environmental-dinoflagellate-atlas/blob/main/01-add_images.py) without observing the multiplicated zarr metadata.

Ok, I am pretty sure this explains this. By doing this you introduce race conditions, and the zarr metadata will be duplicated only if a job gets stuck at a certain point.

Overall the approach you have is not great because you are writing to the dataset.json file in parallel. I will think about how this could be improved.

constantinpape commented 8 months ago

I think the best solution would be to avoid the race condition by not writing the source to the dataset.json in the parallel tasks. Then write just the source information in a second pass - I think it's possible to do this with the same call to add_image. Since everything is computed and cached in tmp_... nothing will be recomputed and the source information should be added correctly.

I implemented this idea in https://github.com/mobie/mobie-utils-python/pull/123 and https://github.com/mobie/environmental-dinoflagellate-atlas/pull/1 but I have not tested it.