ome / ngff

Next-generation file format (NGFF) specifications for storing bioimaging data in the cloud.
https://ngff.openmicroscopy.org
Other
118 stars 41 forks source link

Storing tables #65

Open tischi opened 3 years ago

tischi commented 3 years ago

An outcome of this hackathon was that we would like to store tabular data in ome-zarr.

I wanted to ask whether we should consider that whatever we store can be easily mapped onto a csv file. Meaning that fromCSV and toCSV should work smoothly such that other software that can work with tables can be interoperable with the tabular content of the ome-zarr.

What do you think?

imagesc-bot commented 3 years ago

This issue has been mentioned on Image.sc Forum. There might be relevant details there:

https://forum.image.sc/t/ome-ngff-spatial-omics-hackathon/57337/28

joshmoore commented 3 years ago

toCSV will be easy enough. fromCSV will for many (if not most?) cases require extra metadata. There are a number of attempts to provide such metadata, e.g., https://specs.frictionlessdata.io/data-package/

unidesigner commented 3 years ago

Just out of curiosity, what is the reason of wanting to store tabular data in zarr, v.s. using some existing, optimized data formats, like Avro, Parquet, Sqlite etc. ?

tischi commented 3 years ago

I think the point was to be able to efficiently load chunks of the table from an object store. I don't have enough technological knowledge to judge if any of the existing solutions would do the trick. Do you know?

constantinpape commented 3 years ago

I wanted to ask whether we should consider that whatever we store can be easily mapped onto a csv file. Meaning that fromCSV and toCSV should work smoothly such that other software that can work with tables can be interoperable with the tabular content of the ome-zarr.

I think that compatibility with csv is desirable, but I am not sure how much can be done about this on the spec level. I see this as more of a software than data standard question.

Just out of curiosity, what is the reason of wanting to store tabular data in zarr,

I would say the main reason is to provide all relevant data in the same data format and container. Also note that AnnData, which the proposal is based on, is using zarr as storage already.

tischi commented 3 years ago

Also note that AnnData, which the proposal is based on, is using zarr as storage already.

My worry was that AnnData is much richer than a simple table and thus it may be difficult to map it onto a "simple table"? For example, both in Fiji and Napari there are ways to display a table. Do you think that one could also display AnnData in a "simple table viewer"?

constantinpape commented 3 years ago

My worry was that AnnData is much richer than a simple table

I don't think that AnnData is much richer than a simple table; at least not the subset that we are discussing here. But we have the major advantage that the dtype for each column is known ...

Do you think that one could also display AnnData in a "simple table viewer"?

Sure. Load X into a 2d array, load obs into a 2d array (this works in python where complex dtypes are easy, I don't know how you would do this in java, but you have the same problem in csv), concatenate along the first axis (=columns). This gives you a simple table. (The only question is what to do about var, but for simplicity it could just be ignored).

unidesigner commented 3 years ago

I think the point was to be able to efficiently load chunks of the table from an object store. I don't have enough technological knowledge to judge if any of the existing solutions would do the trick. Do you know?

If found this comparison of zarr and parquet interesting. Especially choosing zarr over parquet for flexibility and the append option. https://waterdata.usgs.gov/blog/cloud_data/

I don't know all the formats in detail, but I imagine that it's not the first time this requirement comes up, and people have implemented solutions for this.

A question I'd have is about the index, i.e. how do you know what slice of data you want to fetch?

tischi commented 3 years ago

A question I'd have is about the index, i.e. how do you know what slice of data you want to fetch?

Good question! Typically one row in the table would correspond to a specific region (e.g. a point) in the image.

One use-case is that if people look at an image region one wants to load all the rows that correspond to this image region, e.g. in order to render something on the image.

We were thinking that an efficient image-coordinate to table-row mapping could be done by a tree where you enter the coordinate and the leaves of the tree are the table-row indices. However, how to serialize a tree into ome-zarr is something that we did not look into yet....

unidesigner commented 3 years ago

An R-Tree is probably what you are looking for. Not sure if this can be serialized in a way where it is not necessary to do a full-table scan to find out about the relevant rows.

The pandas docs has some interesting links as well for out-of-memory data formats/library, in particular the ecosystem page. It's not only about purely fetching data for visualization purposes, but also for efficient compute.

imagesc-bot commented 2 years ago

This issue has been mentioned on Image.sc Forum. There might be relevant details there:

https://forum.image.sc/t/next-generation-user-friendly-smlm-processing-software-aka-thunderstorm-2-0/62289/20

oeway commented 2 years ago

I have been testing octree-based spatial partitioning of point cloud (using a library called potree) for the shareloc.xyz platform. It allows us visualising large point cloud instantly (instead of downloading everything).

Here is a demo for visualising single-molecule localisation microscopy data: https://imodpasteur.github.io/shareloc-utils/shareloc-potree-viewer.html?pointShape=circle&pointSizeType=adaptive&name=FFB000&load=https://imjoy-s3.pasteur.fr/public/pointclouds/7312e0.zip

When you open and zoom in, more point chunks will be loaded to the browser.

The tree is stored in a zip file and I used HTTP Range request to obtain the chunks.

As I understand, the tabular support we are discussing here won't allow storing point chunks organized in a tree yet, am I right?

cc @joshmoore

joshmoore commented 2 years ago

Also cc: @kevinyamauchi and @ivirshup who are also discussing more on this this week.

I think you are right that there's no tree representation in the current discussions, but perhaps it's more a matter of AND rather than OR. That is, my understanding of the benefit of tabular layout is the ability to add annotations to the data. How would that work in the three representation? Does one need both?

For those in interested in taking a look, here are some brief details on the contents of @oeway's zip:

unzipped 7312e0.zip ``` cat sources.json | jq . { "bounds": { "min": [ 1600.013671875, 1633.791748046875, 0 ], "max": [ 41039.4140625, 40804.4296875, 0 ] }, "projection": "", "sources": [ { "name": ".tmp.txt", "points": 16898373, "bounds": { "min": [ 1600.013671875, 1633.791748046875, 0 ], "max": [ 41039.4140625, 40804.4296875, 0 ] } } ] } cat cloud.js | jq . { "version": "1.7", "octreeDir": "data", "projection": "", "points": 16898373, "boundingBox": { "lx": 1600.013671875, "ly": 1633.791748046875, "lz": 0, "ux": 41039.4140625, "uy": 41073.192138671875, "uz": 39439.400390625 }, "tightBoundingBox": { "lx": 1600.013671875, "ly": 1633.791748046875, "lz": 0, "ux": 41039.4140625, "uy": 40804.4296875, "uz": 0 }, "pointAttributes": [ "POSITION_CARTESIAN", "COLOR_PACKED" ], "spacing": 341.55523681640625, "scale": 0.001, "hierarchyStepSize": 5 } tree data/ | head data/ └── r ├── 00060 │   ├── r00060.bin │   └── r00060.hrc ├── 00062 │   ├── r00062.bin │   └── r00062.hrc ├── 00064 │   ├── r00064.bin tree data/ | tail ├── r6642.bin ├── r6644.bin ├── r6646.bin ├── r666.bin ├── r6660.bin ├── r6662.bin ├── r6664.bin └── r6666.bin 760 directories, 3368 files ```
kevinyamauchi commented 2 years ago

Hey @oeway ! Nice to see you here. Super cool that you're looking into rendering with spatial partitioning.

Indeed, we are currently focusing on storing points in a table and we are not planning to specify the format for spatial indices (for now). I think there are too many different strategies and the best one is likely application dependent, so I think it doesn't make sense to standardize that. I am definitely open to adding specs for some common spatial indices (e.g., octree, rtree) at some point once we have the basic table spec nailed down.

The pattern I would advocate for is that one queries the spatial index (e.g., octree) to look up the rows to fetch from the table for rendering. The table can be chunked along the rows, so this will allow points to be loaded lazily. Of course, the performance will depending on your chunking and the ordering of your table.

What do you think @oeway , does this sound reasonable?

oeway commented 2 years ago

Hey,

@joshmoore Would it make sense to consider the spatial partitioning information as some sort of annotation to the x,y,z columns?

@kevinyamauchi Good to see you here too! Yes, I think it does! It will be certainly useful for the use case I am targeting (i.e. SMLM data), I can also see it will be super useful to store massive scatter plots, e.g. generated from scRNA-seq.

I just did a quick read in your existing PR. In practice, if we do want to support octree (that's the one mostly used for displaying LiDAR sensory data and has been proven to work with enven trillions of points for browser-based visualization), would it mean we just add additional tables to var? I would be happy to work with any of you to make a data loader to bridge with the potree viewer (the one I am using now).

ivirshup commented 2 years ago

Would it make sense to consider the spatial partitioning information as some sort of annotation to the x,y,z columns?

I think it might make sense to consider spatial indexing as a property of the coordinate array. Especially if you have the same points represented in different coordinate spaces (e.g. slide by itself, slide aligned in stack).