tischi / i2k-2020-s3-zarr-workshop

0 stars 1 forks source link

Plans for converting the data #2

Open constantinpape opened 3 years ago

constantinpape commented 3 years ago

@tischi, I wrote a couple of mails with @joshmoore today and as far as I understand the current plan is the following: We don't ship the data to josh and instead convert and upload it locally.

I have a converter script and I am pretty sure it does the right thing, but I have a couple of other questions:

P.S I made a new issue because #1 got a bit crowded.

tischi commented 3 years ago

Should we put the new data in a separate bucket? I can ask Josep to create one.

Yes, why not. Let's call it i2k-2020

Do we keep the same folder structure as for the other mobie projects?

From my point of view we don't need any folder structure because there will be only three files (see the very first post here: https://github.com/tischi/i2k-2020-s3-zarr-workshop/issues/1). But I don't know if for @joshmoore's current vision of the ome.zarr file-format they somehow should be in the same zarr container, because they are part of one dataset. @joshmoore would need to say.

I would suggest not to add the full res raw data, but only the 100nm version.

Yes! Excellent suggestion!

Which data do we add apart from that.

As said above: in terms of files see the very first post here: https://github.com/tischi/i2k-2020-s3-zarr-workshop/issues/1

I am not sure about the table. I don't think @joshmoore has something yet ready to store the table in zarr format?!

And ❤️ for helping!

constantinpape commented 3 years ago

From my point of view we don't need any folder structure because there will be only three files (see the very first post here: #1). But I don't know if for @joshmoore's current vision of the ome.zarr file-format they somehow should be in the same zarr container, because they are part of one dataset. @joshmoore would need to say.

Ok, in that case I would just add a single root zarr file with three multiscale datasets:

platy.zarr/
  em-raw/
     ...
  em-segmentation-cells/
     ...
  prospr-myosin/
    ...

I am not sure about the table. I don't think @joshmoore has something yet ready to store the table in zarr format?!

We could just store it as a 2d dataset with column names in the header, but I think there is indeed not a NGF format for tables yet.

Anyway, I will start with the volumetric data and let you know once I have something. (I will probably just start with the myosin volume, so @joshmoore can check it out once I have put it on the bucket and after we make sure the format is correct we add the larger files).

tischi commented 3 years ago

Related to this: https://github.com/tischi/i2k-2020-s3-zarr-workshop/issues/3

If we want to use the MoBIE infrastructure the most straightforward would be if there would be somewhere an images.json file (like this one) pointing to three bdv.xml files (like this one) with <ImageLoader format="bdv.n5.zarr.s3">. If we would do this, we may "only" have to get this done (and some hopefully small add-ons in MoBIE) in order to have a working example to further iterate on.

constantinpape commented 3 years ago

pointing to three bdv.xml files (like this one) with <ImageLoader format="bdv.n5.zarr.s3">

If we do this there are a few questions about the file layout, because we cannot simply use what I suggested here, because bdv assumes fixed paths inside the dataset (setup0/timepoint0, ...).

I see three options:

joshmoore commented 3 years ago

But I don't know if for @joshmoore's current vision of the ome.zarr file-format they somehow should be in the same zarr container, because they are part of one dataset. @joshmoore would need to say.

I don't think so.

I don't think @joshmoore has something yet ready to store the table in zarr format?!

There is some work now on an initial format:

which briefly looks like this:

/opt/data/6001240.zarr $ cat labels/0/.zattrs
{
    "image-label": {
        "properties": [
            {
                "label-value": 1,
                "class": "foo"
            },
            {
                "label-value": 2,
                "class": "bar"
            }
        ],
        "colors": [
            {
                "label-value": 1,
                "rgba": [
                    128,
                    128,
                    128,
                    128
                ]
            },

Ok, in that case I would just add a single root zarr file with three multiscale datasets:

Also ok.

constantinpape commented 3 years ago

But I don't know if for @joshmoore's current vision of the ome.zarr file-format they somehow should be in the same zarr container, because they are part of one dataset. @joshmoore would need to say.

I don't think so.

Ok, let's discuss the layout tomorrow in the meeting.

There is some work now on an initial format:

* [ome/omero-cli-zarr#50](https://github.com/ome/omero-cli-zarr/pull/50)

* [ome/ome-zarr-py#61](https://github.com/ome/ome-zarr-py/pull/61)

* [ome/ome-zarr-py#63](https://github.com/ome/ome-zarr-py/pull/63)

This will produce large jsons in our case :). But we can give it a try; and in the future we can hopefully switch to storing the table as a zarr array.

tischi commented 3 years ago

But we can give it a try; and in the future we can hopefully switch to storing the table as a zarr array.

For the testing, you could just write one feature value, like size.

tischi commented 3 years ago

Personally, if I would like to get something working within one week until i2k, I would do the following:

  1. Store data like this on EMBL S3
images.json
a.xml
b.xml
c.xml
a.zarr
b.zarr
c.zarr
  1. Copy all the code from https://github.com/joshmoore/n5-zarr/tree/s3zarr into a branch of MoBIE
  2. Work within the MoBIE branch until we can read the images into BDV
  3. Take it from there, e.g. factor out the s3zarr stuff into its own repo again, discuss metadata a.s.o.
joshmoore commented 3 years ago

This will produce large jsons in our case :)

Yup. Definitely aware. I had tried the zarr array solution but ran into https://github.com/saalfeldlab/n5/pull/73#issuecomment-688731487 Also discussed possible integrate with Parquet etc last night on the community call. Open to thoughts.

constantinpape commented 3 years ago

@tischi your plan sounds good. I can def. set up 1. :). Will try to do as much as possible there before the meeting tomorrow and then we can finalize the plan before i2k.

constantinpape commented 3 years ago

@joshmoore I uploaded one multiscale dataset to our new bucket.

Could you please check that you can access it? Here's the details:

ServiceEndpoint: https://s3.embl.de
BucketName: i2k-2020
PathInBucket: platy.ome.zarr   (this is the zarr root)

If you can access it, can you check if the dataset at prospr-myosin is compatible with the zarr multiscale format?

Thanks!

joshmoore commented 3 years ago

Hi @constantinpape,

The .zattrs that's in ...ome.zarr/ will need to be in the prospr-myosin/ directory

aws --no-sign-request --endpoint-url=https://s3.embl.de s3 ls --recursive s3://i2k-2020/platy.ome.zarr/ | grep /.z
2020-11-19 14:49:21        400 platy.ome.zarr/.zattrs
2020-11-19 14:49:21         24 platy.ome.zarr/.zgroup
2020-11-19 14:49:21        327 platy.ome.zarr/prospr-myosin/s0/.zarray
2020-11-19 14:49:21        327 platy.ome.zarr/prospr-myosin/s1/.zarray
2020-11-19 14:49:21        327 platy.ome.zarr/prospr-myosin/s2/.zarray
2020-11-19 14:49:21        321 platy.ome.zarr/prospr-myosin/s3/.zarray
constantinpape commented 3 years ago

The .zattrs that's in ...ome.zarr/ will need to be in the prospr-myosin/ directory

Thanks for checking! I fixed it in the code.

constantinpape commented 3 years ago

I added the data according to what we discussed, see #4